A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps.

Size: px

Start display at page:

Download "A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps."

Chrystal Fields
5 years ago
Views:

1 ECON 497: Lecture 6 Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes 6 Specification: Choosing the Independent Variables Studenmund Chapter 6 Before we start, let's get one thing straight. You should use theories about the relationship you are estimating to determine which explanatory or independent variables to include. Everything else in this lecture is of less importance than this basic rule. Specification Error There are three things to be done in specifying an equation to be estimated: 1. Choosing the correct independent variables (which we'll discuss here) 2. Choosing the correct function form (which is in chapter 7) 3. Choosing the correct form of the error term (which is discussed in later chapters) Getting any of these wrong will result in a specification error, which can seriously affect the characteristics of the estimators of the regression coefficients. A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps. Now, some particular problems and their implications. Omitted Variables If there is a variable that should be included in the equation but is left out, there is an omitted variable problem. More specifically, the estimators of the regression coefficients may be biased, meaning that their expected values may not be equal to the actual values. To put this more clearly, consider the relationship: Y i = β 0 + β 1 X 1i + β 2 X 2i + ε i If this is the model which is estimated and all the classical assumptions are satisfied, we can say that E[β 0 hat]=b 0 E[β 1 hat]=β 1 E[β 2 hat]=β 2 That is, the expected value of each estimator is equal to the value it is meant to estimate. If, however, you don't think to include one of these explanatory variables or can't get data on it or whatever, the equation you wind up estimating might be: Y i = β 0 + β 1 X 1i + ε i

2 ECON 497: Lecture 6 Page 2 of 2 In this case, you will attempt to estimate the coefficient β 1, but unfortunately the estimator will be biased. That is: E[β 1 hat] not= β 1 Specifically, as presented in Studenmund, the expected value of β 1 hat is equal to β 1 plus β 2 times a function of the correlation between X 1 and X 2 (ρ 12 ): E[β 1 hat] = β 1 + β 2 *f(ρ 12 ) (6.4) What does this mean? 1. If there is no correlation between the included and the excluded variable, then the estimator of the coefficient of the included variable will be unbiased. That is, if ρ 12 =0, E[β 1 hat]=β If there is some correlation between X 1 and X 2, the sign of the bias on β 1 hat will depend on the sign of the coefficient β (which you can make a guess about from your theory about the relationship between the dependent and independent variables) and the sign of the correlation coefficient. If a relevant explanatory variable is excluded what will be the impact on the estimated coefficients of the included variables? Effect of Excluded Variables on Estimated Coefficients of Included Variables Positive correlation between Negative correlation included and excluded between included and variable excluded variable Excluded variable has a positive effect Estimated coefficient will be positively biased Estimated coefficient will be negatively biased Excluded variable has a negative effect Estimated coefficient will be negatively biased Estimated coefficient will be positively biased Some examples provided by students. A further problem with omitting explanatory variables is that omissions will make the reported standard errors of the estimated coefficients (the standard errors of b 1 hat, b 2 hat) smaller than they should be. Now, we only care about the standard errors of the estimated coefficients because they are used to calculate the t-statistics for the estimated coefficients. To state this clearly: t-stat = b 1 hat/se(b 1 hat) If a variable (such as X 2 )

3 ECON 497: Lecture 6 Page 3 of 3 which should be included is omitted, the numerator of the t-stat will be biased and the denominator will be too small. The net effect on the t-statistic is ambiguous. EX: Consider the example from Studenmund: Y t hat = PC t PB t LYD t (0.13) (0.08) (2.06) t = Y t hat = PC t LYD t (0.12) (0.60) t = When the variable PB is improperly excluded from the equation, the standard errors of the estimated coefficients on PC and LYD become smaller. What effect does this have on the t-statistics? The t-stat on the estimated coefficient for PC gets smaller (in absolute terms) because the magnitude of the estimate of the coefficient gets smaller. The t-stat on the estimated coefficient for LYD gets larger because the estimate gets larger and the standard error is smaller. So, when a variable is omitted, the t-stats become unreliable. However, if you know whether the bias is positive or negative, you can make some predictions about how the correct estimates compare to yours. Return to student examples. Correcting for Omitted Variables The easiest answer is to say that if you know something is omitted, you should simply include it. If that were so easy, though, we wouldn't be discussing this. If there is an independent variable you know should be included but you can't get data for it, you can at least estimate the sign of the bias on the estimates of other coefficients. This will allow you to say that you know that the estimates are biased, but if you know the sign of the bias you can at least say whether your estimates are likely to be either too large or too small. Irrelevant Variables If an independent variable which shouldn't be included in an equation is included, there may be problems with the estimated equation. Fortunately, these shouldn't be a problem. Saying that an explanatory variable doesn't belong in an equation is roughly equivalent to saying that the coefficient on that variable is zero. That is, the variable has no effect on the dependent variable.

4 ECON 497: Lecture 6 Page 4 of 4 As Studenmund puts this (p. 180): If the true model is: Yi = β 0 + β 1 X 1i + ε i but instead, you include X 2 and estimate: Y i = β 0 + β 1 X 1i + β 2 X 2i + ε i ** Then the error term from the second equation (the one you incorrectly estimate) will be ε i **=f(ε i - β 2 X 2i ) However, if X 2 doesn't belong in the equation, then β 2 =0, so ε i ** = ε i The problem is that if the irrelevant variable (X 2 ) is correlated with the relevant variable (X 1 ) then the standard error of the estimated coefficient will be larger than it should be (in absolute value) leading to a smaller t-statistic. As a result, there may be some estimated coefficients that are significant but, because of the inclusion of the irrelevant variable, will appear insignificant. Again, including an irrelevant variable can make significant coefficients appear insignificant. This is similar to the problem resulting from multicollinearity. These problems are nicely summarized by the Table 6.2 in Studenmund: Effect on Coefficient Estimate Omitted Variable Irrelevant Variable Bias Yes, unless ρ 12 =0 No Variance of Estimated Coefficients Decreases, unless ρ 12 =0 Increases, unless ρ 12 =0 T-stat Effect Ambiguous change, unless ρ 12 =0 Decreases, unless ρ 12 =0 Four Important Specification Criteria Studenmund offers four rules for determining if a variable should be added to a regression equation. The first is the most important; the others are supplemental. If you're considering adding a theoretically justifiable variable to your equation, do a regression without it in the equation and then with it in the equation. If these rules are satisfied, the variable should be included. 1. Theory: Is the variable's place in the equation unambiguous and theoretically sound? 2. t-test: Is the variable's estimated coefficient significant in the expected direction? 3. Adj. R 2 : Does the overall fit of the equation (adjusted for the degrees of freedom) improve when the variable is added to the equation? 4. Bias: Do other variables' coefficients change significantly when the variable is added to the equation? Studenmund offers two examples of the application of these rules and you should read these examples. One Additional Consideration As you will all discover, data are not perfect. That is, some observations will likely be missing some values. This may be because people taking a survey decline to answer a

5 ECON 497: Lecture 6 Page 5 of 5 particular question ( Is that your nose, or are you eating a banana? ), because governments don t collect information on the variable collected for an international data set ( Generallisimo, how many political prisoners are we currently holding? ) or because the people answering the question simply don t know the information. Because it is impossible to do a regression using observations missing even one included variable, you may find yourself making a tradeoff between including all the explanatory variable you want and having very few observations or sacrificing one or two variables and having plenty of observations. Under these conditions, doing the regression both ways and comparing the results may be the best solution, but missing variables for some observations may be one reason to include, or more likely exclude, a variable from a regression. Some Practices to Avoid and/or Understand 1. Data Mining: Data mining is the estimation of many (or all) possible regression equations with no regard for theoretical justification in a blind attempt to get the desired results. Remember, a level of significance of 5% in a hypothesis test means that even if an explanatory variable has no influence on a dependent variable, 5% of the time it will appear to have influence. If you try twenty different models which have no real explanatory power, the expected number of models which will appear to have significant explanatory power (at a 5% level of significance) is one. 2. Stepwise Regressions: This is the process of allowing a computer package to choose the explanatory variable which has the greatest explanatory power, then having chosen the first, chooses a second explanatory variable which adds the most explanatory power from those remaining, and so on. There are actually procedures written into some statistical packages which do this automatically if asked. Computers are dumb machines and know not what they do, but the software packages have this features because there are equally dumb researchers out there who want to use this procedure. If someone seriously presents results from a stepwise regression, you should taunt them about their lack of an underlying theory until they cry. 3. Sequential Specification Searches: This is the process of starting with the variables you know should be included and then trying others about which you are less sure. This isn't necessarily a bad idea, but there is always a concern about what reasons the researchers may have had for reporting the results that they did. If a number of specifications are tried, all of their results should be

6 ECON 497: Lecture 6 Page 6 of 6 either presented or, at least, mentioned in the final report. If, for example, a large number of the secondary explanatory variables had little or no effect on the estimated coefficients and explanatory power, this could be discussed in a footnote or an appendix. 4. Relying on t-test Results: Problems with multicollinearity and omitted variable bias can make t-tests unreliable indicators of which variables should be included or excluded. 5. Scanning: As far as I can tell, this refers to data mining one data set to find a good specification and then estimating that model using a different data set. Please note that this requires two distinct data sets. 6. Sensitivity Analysis: This is the practice of estimating your preferred specification, determining which results are important, and then estimating some slight variations of the specification to see if the important results are preserved or if they disappear. If the important result(s) persist across slight changes in the model, these results are said to be robust to slight changes in the model. If these important results disappear when the model is changed slightly, they may be artificial products of a particular specification and do not accurately reflect a relationship in the data. Presentation of the results from several different specifications can clarify the robustness of important regression results. A fun question to ask someone presenting results is, "Are your important results robust to changes in model specification?" An Example: Automobile Acceleration Times One of the data sets from Studenmund deals with acceleration times (S) from 0 to 62 mph for various automobiles. Two of the explanatory variables are weight in pounds (P) and engine horsepower (H). As you can see below, there is a positive correlation between these two variables.

7 ECON 497: Lecture 6 Page 7 of 7 Correlations Correlations T E P H T E P H Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation ** Sig. (2-tailed) N Pearson Correlation ** Sig. (2-tailed) N **. Correlation is significant at the 0.01 level (2-tailed). What would be the impact of removing one of them from the regression? More specifically, what would be the impact on the coefficient of H if P were excluded from the regression? H should have a negative coefficient while P has a positive coefficient. The two variables are positively correlated. So, if P is excluded, the magnitude of the negative coefficient on H should be reduced. That is, the negative estimated coefficient on H should become a smaller negative number.

8 ECON 497: Lecture 6 Page 8 of 8 Regression Model 1 Model Summary Adjusted Std. Error of R R Square R Square the Estimate.842 a a. Predictors: (Constant), H, T, E, P Model 1 (Constant) T E P H a. Dependent Variable: S Regression Model 1 Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig E E Model Summary Adjusted Std. Error of R R Square R Square the Estimate.839 a a. Predictors: (Constant), H, T, E Model 1 (Constant) T E H a. Dependent Variable: S Unstandardized Coefficients Coefficients a Standardi zed Coefficien ts B Std. Error Beta t Sig E

ECON 497 Midterm Spring

ECON 497 Midterm Spring 2009 1 ECON 497: Economic Research and Forecasting Name: Spring 2009 Bellas Midterm You have three hours and twenty minutes to complete this exam. Answer all questions and explain