Chapter 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables

Size: px

Start display at page:

Download "Chapter 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables"

Felix Holt
6 years ago
Views:

1 Chapter 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables Chapter 14 Outline Review o Unbiased Estimation Procedures Estimates and Random Variables Mean of the Estimate s Probability Distribution Variance of the Estimate s Probability Distribution o Correlated and Independent (Uncorrelated) Variables Scatter Diagrams Correlation Coefficient Omitted Explanatory Variables o A Puzzle: Baseball Attendance o Goal of Multiple Regression Analysis o Omitted Explanatory Variables and Bias o Resolving the Baseball Attendance Puzzle o Omitted Variable Summary Multicollinearity o Perfectly Correlated Explanatory Variables o Highly Correlated Explanatory Variables o Earmarks of Multicollinearity Irrelevant Explanatory Variables Chapter 14 Prep Questions 1. Review the goal of multiple regression analysis. In words, explain what multiple regression analysis attempts to do? 2. Recall that the presence of a random variable brings forth both bad news and good news. a. What is the bad news? b. What is the good news? 3. Consider an estimate s probability distribution. Review the importance of its mean and variance: a. Why is the mean of the probability distribution important? Explain. b. Why is the variance of the probability distribution important? Explain. 4. Suppose that two variables are positively correlated. a. In words, what does this mean? b. What type of graph do we use to illustrate their correlation? What does the graph look like? c. What can we say about their correlation coefficient?

2 2 d. When two variables are perfectly positively correlated, what will their correlation coefficient equal? 5. Suppose that two variables are independent (uncorrelated). a. In words, what does this mean? b. What type of graph do we use to illustrate their correlation? What does the graph look like? c. What can we say about their correlation coefficient? Baseball Data: Panel data of baseball statistics for the 588 American League games played during the summer of Attendance t DateDay t DateMonth t DateYear t DayOfWeek t Paid attendance for game t Day of game t Month of game t Year of game t Day of the week for game t (Sunday=0, Monday=1, etc.) DH t Designator hitter for game t (1 if DH permitted; 0 otherwise) HomeGamesBehind t Games behind of the home team for before game t HomeIncome t HomeLosses t HomeNetWins t HomeSalary t HomeWins t PriceTicket t VisitGamesBehind t VisitLosses t VisitNetWins t VisitSalary t Per capita income in home team's city for game t Season losses of the home team before game t Net wins (wins less losses) of the home team before game t Player salaries of the home team for game t (millions of dollars) Season wins of the home team before the game before game t Average price of tickets sold for game t s home team (dollars) Games behind of the visiting team before game t Season losses of the visiting team before the game t Net wins (wins less losses) of the visiting team before game t Player salaries of the visiting team for game t (millions of dollars) Season wins of the visiting team before the game VisitWins t 6. Focus on the baseball data. a. Consider the following simple model: Attendance t = β Const + β Price PriceTicket t + e t

3 3 Attendance depends only on the ticket price. 1) What does the economist s downward sloping demand curve theory suggest about the sign of the PriceTicket coefficient, β Price? 2) Use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters. Interpret the regression results. [Link to MIT-ALSummer-1996.wf1 goes here.] b. Consider a second model: Attendance t = β Const + β Price PriceTicket t + β HomeSalary HomeSalary t + e t Attendance depends not only on the ticket price, but also on the salary of the home team. 1) Devise a theory explaining the effect that home team salary should have on attendance. What does your theory suggest about the sign of the HomeSalary coefficient, β HomeSalary? 2) Use the ordinary least squares (OLS) estimation procedure to estimate both of the model s coefficients. Interpret the regression results. c. What do you observe about the estimates for the PriceTicket coefficients in the two models? 7. Again, focus on the baseball data and consider the following two variables: Attendance t Paid attendance at the game t PriceTicket t Average ticket price in terms of dollars for game t You can access these data by clicking the following link: [Link to MIT-ALSummer-1996.wf1 goes here.] Generate a new variable, PriceCents, to express the price in terms of cents rather than dollars: PriceCents = 100 PriceTicket a. What is the correlation coefficient for PriceTicket and PriceCents? b. Consider the following model: Attendance t = β Const + β PriceTicket PriceTicket t + β PriceCents PriceCents t + e t Run the regression to estimate the parameters of this model. You will get an unusual result. Explain this by considering what multiple regression analysis attempts to do.

4 4 8. The following are excerpts from an article appearing in the New York Times on September 1, 2008: Doubt Grow Over Flu Vaccine in Elderly By Brenda Goodman The influenza vaccine, which has been strongly recommended for people over 65 for more than four decades, is losing its reputation as an effective way to ward off the virus in the elderly. A growing number of immunologists and epidemiologists say the vaccine probably does not work very well for people over 70 The latest blow was a study in The Lancet last month that called into question much of the statistical evidence for the vaccine s effectiveness. The study found that people who were healthy and conscientious about staying well were the most likely to get an annual flu shot. [others] are less likely to get to their doctor s office or a clinic to receive the vaccine. Dr. David K. Shay of the Centers for Disease Control and Prevention, a co-author of a commentary that accompanied Dr. Jackson s study, agreed that these measures of health were not incorporated into early estimations of the vaccine s effectiveness and could well have skewed the findings. a. Does being healthy and conscientious about staying well increase or decrease the chances of getting flu? b. According to the article, are those who are healthy and conscientious about staying well more or less likely to get a flu shot? c. The article alleges that previous studies did not incorporate health and conscientious in judging the effectiveness of flu shots. If the allegation is true, have previous studies overestimated or underestimated the effectiveness of flu shots? d. Suppose that you were the director of your community s health department. You are considering whether or not to subsidize flu vaccines for the elderly. Would you find the previous studies useful? That is, would a study that did not incorporate health and conscientious in judging the effectiveness of flu shots help you decide if your department should spend your limited budget to subsidize flu vaccines? Explain.

5 5 Review Unbiased Estimation Procedures Estimates and Random Variables Estimates are random variables. Consequently, there is both good news and bad news. Before the data are collected and the parameters are estimated: Bad news: We cannot determine the numerical values of the estimates with certainty (even if we knew the actual values). Good news: On the other hand, we can often describe the probability distribution of the estimate telling us how likely it is for the estimate to equal its possible numerical values. Mean (Center) of the Estimate s Probability Distribution An unbiased estimation procedure does not systematically underestimate or overestimate the actual value. The mean (center) of the estimate s probability distribution equals the actual value. Applying the relative frequency interpretation of probability, when the experiment is repeated many, many times, the average of the numerical values of the estimates equals the actual value. Probability Distribution Actual Value Estimate Figure 14.1: Probability Distribution of an Estimate Unbiased Estimation Procedure

6 6 If the distribution is symmetric, we can provide an interpretation that is perhaps even more intuitive. When the experiment were repeated many, many times half the time the estimate is greater than the actual value; half the time the estimate is less than the actual value. Accordingly, we can apply the relative frequency interpretation of probability. In one repetition, the chances that the estimate will be greater than the actual value equal the chances that the estimate will be less. Variance (Spread) of the Estimate s Probability Distribution Variance Figure 14.2: Probability Distribution of an Estimate Importance of Variance When the estimation procedure is unbiased, the distribution variance (spread) indicates the estimate s reliability, the likelihood that the numerical value of the estimate will be close to the actual value.

7 7 Correlated and Independent (Uncorrelated) Variables Two variables are correlated whenever the value of one variable does help us predict the value of the other. independent (uncorrelated) whenever the value of one variable does not help us predict the value of the other; Scatter Diagrams Figure 14.3: Scatter Diagrams, Correlation, and Independence The Dow Jones and Nasdaq growth rates are positively correlated. Most of the scatter diagram points lie in the first and third quadrants. When the Dow Jones growth rate is high, the Nasdaq growth rate is usually high also. Similarly, when the Dow Jones growth rate is low, the Nasdaq growth rate is usually low also. Knowing one growth rate helps us predict the other. On the other hand, Amherst precipitation and the Nasdaq growth rate are independent, uncorrelated. The scatter diagram points are spread rather evenly across the graph. Knowing the Nasdaq growth rate does not help us predict Amherst precipitation and vice versa.

8 8 Correlation Coefficient The correlation coefficient indicates the degree to which two variables are correlated; the correlation coefficient ranges from 1 to +1: = 0 Independent (uncorrelated) Knowing the value of one variable does not help us predict the value of the other. > 0 Positive correlation Typically, when the value of one variable is high, the value of the other variable will be high. < 0 Negative correlation Typically, when the value of one variable is high, the value of the other variable will be low. Omitted Explanatory Variables We shall consider baseball attendance data to study the omitted variable phenomena. Project: Assess the determinants of baseball attendance. Baseball Data: Panel data of baseball statistics for the 588 American League games played during the summer of Attendance t Paid attendance for game t DateDay t Day of game t DateMonth t Month of game t DateYear t Year of game t DayOfWeek t Day of the week for game t (Sunday=0, Monday=1, etc.) DH t Designator hitter for game t (1 if DH permitted; 0 otherwise) HomeGamesBehind t Games behind of the home team before game t HomeIncome t Per capita income in home team's city for game t HomeLosses t Season losses of the home team before game t HomeNetWins t Net wins (wins less losses) of the home team before game t HomeSalary t Player salaries of the home team for game t (millions of dollars) HomeWins t Season wins of the home team before game t PriceTicket t Average price of tickets sold for game t s home team (dollars) VisitGamesBehind t Games behind of the visiting team before game t VisitLosses t Season losses of the visiting team before game t

9 9 VisitNetWins t VisitSalary t VisitWins t Net wins (wins less losses) of the visiting team before game t Player salaries of the visiting team for game t (millions of dollars) Season wins of the visiting team before the game A Puzzle: Baseball Attendance Let us begin our analysis by focusing on the price of tickets. Consider the following two models that attempt to explain game attendance: Model 1: Attendance depends on ticket price only. The first model has a single explanatory variable, ticket price, PriceTicket: Attendance t = β Const + β Price PriceTicket t + e t Downward Sloping Demand Theory: This model is based on the economist s downward sloping demand theory. An increase in the price of a good decreases the quantity demand. Higher ticket prices should reduce attendance; hence, the PriceTicket coefficient should be negative: β Price < 0 We shall use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters: [Link to MIT-ALSummer-1996.wf1 goes here.] Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket Const Number of Observations 585 Estimated Equation: EstAttendance = 3, ,897PriceTicket Interpretation of Estimates: b PriceTicket = 1,897. We estimate that a $1.00 increase in the price of tickets increases attendance by 1,897 per game. Table 14.1: Baseball Attendance Regression Results Ticket Price Only The estimated coefficient for the ticket price is positive suggesting that higher prices lead to an increase in quantity demanded. This contradicts the downward sloping demand theory, does it not?

10 10 Model 2: Attendance depends on ticket price and salary of home team. In the second model, we include not only the price of tickets, PriceTicket, as an explanatory variable, but also the salary of the home team, HomeSalary: Attendance t = β Const + β Price PriceTicket t + β HomeSalary HomeSalary t + e t We can justify the salary explanatory variable in the grounds that fans like to watch good players. We shall call this the star theory. Presumably, a high salary team has better players, more stars, on its roster and accordingly will draw more fans. Star Theory: Teams with higher salaries will have better players which will increase attendance. The HomeSalary coefficient should be positive: β HomeSalary > 0 Now, use the ordinary least squares (OLS) estimation procedure to estimate the parameters. Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary Const Number of Observations 585 Estimated Equation: EstAttendance = 9, PriceTicket + 783HomeSalary Interpretation of Estimates: b PriceTicket = 591. We estimate that a $1.00 increase in the price of tickets decreases attendance by 591 per game. b HomeSalary = 783. We estimate that a $1 million increase in the home team salary increases attendance by 783 per game. Table 14.2: Baseball Attendance Regression Results Ticket Price and Home Team Salary These coefficient estimates lend support to our theories. The two models produce very different results concerning the effect of the ticket price on attendance. More specifically, the coefficient estimate for ticket price changes drastically from 1,897 to 591 when we add home team salary as an explanatory variable. This is a disquieting puzzle. We shall resolve this puzzle by reviewing the goal of multiple regression analysis and then explaining when omitting an explanatory variable will prevent us from achieving the goal.

11 11 Goal of Multiple Regression Analysis Multiple regression analysis attempts to sort out the individual effect of each explanatory variable. The estimate of an explanatory variable s coefficient allows us to assess the effect that an individual explanatory variable itself has on the dependent variable. An explanatory variable s coefficient estimate estimates the change in the dependent variable resulting from a change in that particular explanatory variable while all other explanatory variables remain constant. In Model 1 we estimate that a $1.00 increase in the ticket price increase attendance by nearly 2,000 per game whereas in Model 2 we estimate that a $1.00 increase decreases attendance by about 600 per game. The two models suggest that the individual effect of the ticket price is very different. The omitted variable phenomenon allows us to resolve this puzzle. Omitted Explanatory Variables and Bias Claim: Omitting an explanatory variable from a regression will bias the estimation procedure whenever two conditions are met. Bias results if the omitted explanatory variable influences the dependent variable; is correlated with an included explanatory variable. When these two conditions are met, the coefficient estimate of the included explanatory variable is a composite of two effects, the influence that the included explanatory variable itself has on the dependent variable (direct effect); omitted explanatory variable has on the dependent variable because the included explanatory variable also acts as a proxy for the omitted explanatory variable (proxy effect). Since the goal of multiple regression analysis is to sort out the individual effect of each explanatory variable we want to capture only the direct effect.

12 Econometrics Lab 14.1: Omitted Variable Proxy Effect We can now use the Econometrics Lab to justify our claims concerning omitted explanatory variables.

12 12 Econometrics Lab 14.1: Omitted Variable Proxy Effect We can now use the Econometrics Lab to justify our claims concerning omitted explanatory variables. The following regression model including two explanatory variables is used: Model: y t = β Const + β x1 x1 t + β x2 x2 t + e t [Link to MIT-Lab 14.1 goes here.] Figure 14.4: Omitted Variable Simulation The simulation provides us with two options; we can either include both explanatory variables in the regression, Both Xs or just one, Only X1. By default the Only X1 option is selected, consequently the second explanatory variable is omitted. That is, x1 t is the included explanatory variable and x2 t is the omitted explanatory variable. For simplicity, assume that x1 s coefficient, β x1, is positive. We shall consider three cases to illustrate when bias does and does not result:

13 13 Case 1. The coefficient of the omitted explanatory variable is positive and the two explanatory variables are independent (uncorrelated). Case 2. The coefficient of the omitted explanatory variable equals zero and the two explanatory variables are positively correlated. Case 3. The coefficient of the omitted explanatory variable is positive and the two explanatory variables are positively correlated. We shall now show that only in the last case does bias results because only in the last case is the proxy effect is present. Case 1: The coefficient of the omitted explanatory variable is positive and the two explanatory variables are independent (uncorrelated). Will bias result in this case? Since the two explanatory variables are independent (uncorrelated), an increase in the included explanatory variable, x1 t, typically will not affect the omitted explanatory variable, x2 t. Consequently, the included explanatory variable, x1 t, will not act as a proxy for the omitted explanatory variable, x2 t. Bias should not result. Included variable x1 t up Independence Typically, omitted variable x2 t unaffected β x1 > 0 β x2 > 0 y t up y t unaffected Direct Effect No Proxy Effect We shall use our lab to confirm this logic. By default, the actual coefficient for the included explanatory variable, x1 t, equals 2 and the actual coefficient for the omitted explanatory variable, x2 t, is nonzero, it equals 5. Their correlation coefficient, Corr X1&X2, equals.00; hence, the two explanatory variables are independent (uncorrelated). Be certain that the Pause checkbox is cleared. Click Start and after many, many repetitions, click Stop. Table 14.3 reports that the average value of the coefficient estimates for the included explanatory variable equals its actual value. Both equal 2.0. The ordinary least squares (OLS) estimation procedure is unbiased. Percent of Coef1 Estimates Actual Actual Corr Mean (Average) Below Actual Above Actual Coef 1 Coef 2 Coef of Coef1Estimates Value Value Table 14.3: Omitted Variables Simulation Results

14 14 The ordinary least squares (OLS) estimation procedure captures the individual influence that the included explanatory variable itself has on the dependent variable. This is precisely the effect that we wish to capture. The ordinary least squares (OLS) estimation procedure is unbiased; it is doing what we want it to do. Case 2: The coefficient of the omitted explanatory variable equals zero and the two explanatory variables are positively correlated. In the second case, the two explanatory variables are positively correlated; when the included explanatory variable, x1 t, increases, the omitted explanatory variable, x2 t, will typically increase also. But the actual coefficient of the omitted explanatory variable, β x2, equals 0; hence, the dependent variable, y t, is unaffected by the increase in x2 t. There is no proxy effect because the omitted variable, x2 t, does not affect the dependent variable; hence, bias should not result. Included variable x1 t up Positive Correlation Typically, omitted variable x2 t up β x1 > 0 β x2 = 0 y t up y t unaffected Direct Effect No Proxy Effect To confirm our logic with the simulation, be certain that the actual coefficient for the omitted explanatory variable equals 0 and the correlation coefficient equals.30. Click Start and then after many, many repetitions, click Stop. Table 14.4 reports that the average value of the coefficient estimates for the included explanatory variable equals its actual value. Both equal 2.0. The ordinary least squares (OLS) estimation procedure is unbiased. Percent of Coef1 Estimates Actual Actual Corr Mean (Average) Below Actual Above Actual Coef 1 Coef 2 Coef of Coef1Estimates Value Value Table 14.4: Omitted Variables Simulation Results Again, the ordinary least squares (OLS) estimation procedure captures the influence that the included explanatory variable itself has on the dependent variable. Again, there is no proxy effect and all is well.

15 15 Case 3: The coefficient of the omitted explanatory variable is positive and the two explanatory variables are positively correlated. As with Case 2, the two explanatory variables are positively correlated; when the included explanatory variable, x1 t, increases the omitted explanatory variable, x2 t, will typically increase also. But now, the actual coefficient of the omitted explanatory variable, β x2, is no longer 0, it is positive; hence, an increase in the omitted explanatory variable, x2 t, increases the dependent variable. In additional to having a direct effect on the dependent variable, the included explanatory variable, x1 t, also acts as a proxy for the omitted explanatory variable, x2 t. There is a proxy effect. Included variable x1 t up Positive Correlation Typically, omitted variable x2 t up β x1 > 0 β x2 > 0 y t up y t up Direct Effect Proxy Effect In the simulation, the actual coefficient of omitted explanatory variable, β x2, once again equals 5. The two explanatory variables are still positively correlated, the correlation coefficient equals.30. Click Start and then after many, many repetitions, click Stop. Table 14.5 reports that the average value of the coefficient estimates for the included explanatory variable, 3.5, exceeds its actual value, 2.0. The ordinary least squares (OLS) estimation procedure is biased upward. Percent of Coef1 Estimates Actual Actual Corr Mean (Average) Below Actual Above Actual Coef 1 Coef 2 Coef of Coef1Estimates Value Value Table 14.5: Omitted Variables Simulation Results Now we have a problem. The ordinary least squares (OLS) estimation procedure overstates the influence of the included explanatory variable, the effect that the included explanatory variable itself has on the dependent variable.

16 16 Let us now take a brief aside. Case 3 provides us with the opportunity to illustrate what bias does and does not mean. b Coef1 < Figure 14.5: Probability Distribution of an Estimate Upward Bias b Coef1 What bias does mean: Bias means that the estimation procedure systematically overestimates or underestimates the actual value. In this case, upward bias is present. The average of the estimates is greater than the actual value after many, many repetitions. What bias does not mean: Bias does not mean that the value of the estimate in a single repetition must be less than the actual value in the case of downward bias or greater than the actual value in the case of upward bias. Focus on the last simulation. The ordinary least squares (OLS) estimation procedure is biased upward as a consequence of the proxy effect. Despite the upward bias, however, the estimate of the included explanatory variable is less than the actual value in 12.5 percent of the repetitions. Upward bias does not guarantee that in any one repetition the estimate will be greater than the actual value. It just means that it will be greater on average. If the probability distribution is symmetric, the chances of the estimate being greater than the actual value exceed the chances of being less.

17 17 Now, we return to our three omitted variable cases by summarizing them: Does the omitted Is the omitted Estimation procedure variable influence the variable correlated with for the included Case dependent variable? an included variable? variable is 1 Yes No Unbiased 2 No Yes Unbiased 3 Yes Yes Biased Table 14.6: Omitted Variables Simulation Summary Econometrics Lab 14.2: Avoiding Omitted Variable Bias Question: Is the estimation procedure biased or unbiased when both explanatory variables are included in the regression? [Link to MIT-Lab 14.2 goes here.] To address this question, Both Xs is now selected. This means that both explanatory variables, x1 t and x2 t, will be included in the regression. Both explanatory variables affect the dependent variable and they are correlated. As we saw in Case 3, if one of the explanatory variables is omitted, bias will result. To see what occurs when both explanatory variables are included, click Start and after many, many repetitions, click Stop. When both variables are included, however, the ordinary least squares (OLS) estimation procedure is unbiased: Actual Actual Correlation Mean of Coef 1 Coef 1 Coef 2 Parameter Estimates Table 14.7: Omitted Variables Simulation Results No Omitted Variables Conclusion: To avoid omitted variable bias, all relevant explanatory variables should be included in a regression.

18 18 Resolving the Baseball Attendance Puzzle We begin by reviewing the baseball attendance models: Model 1: Attendance depends on ticket price only. Attendance t = β Const + β Price PriceTicket t + e t Estimated Equation: EstAttendance = 3, ,897PriceTicket Interpretation: We estimate that $1.00 increase in the price of tickets increases by 1,897 per game. Model 2: Attendance depends on ticket price and salary of home team. Attendance t = β Const + β Price PriceTicket t + β HomeSalary HomeSalary t + e t Estimated Equation: EstAttendance = 9, PriceTicket + 783HomeSalary Interpretation: We estimate that a $1.00 increase in the price of tickets decreases attendance by 591 per game. $1 million increase in the home team salary increases attendance by 783 per game. The ticket price coefficient estimate is affected dramatically by the presence of home team salary; in Model 1 the estimate is much higher 1,897 versus 591. Why? We shall now argue that when ticket price is included in the regression and home team salary is omitted, as in Model 1, there reason to believe that the estimation procedure for the ticket price coefficient will be biased. We just learned that the omitted variable bias results when the following two conditions are met; when an omitted explanatory variable: influences the dependent variable and is correlated with an included explanatory variable. Now focus on Model 1. Attendance t = β Const + β Price PriceTicket t + e t Model 1 omits home team salary, HomeSalary t. Are the two omitted variable bias conditions met? It certainly appears reasonable to believe that the omitted explanatory variable, HomeSalary t, affects the dependent variable, Attendance t. The club owner who is paying the high salaries certainly believes so. The owner certainly hopes that by hiring better players more fans will attend the games. Consequently, it appears that the first condition required for omitted variable bias is met. We can confirm the correlation by using statistical software to calculate the correlation matrix:

19 19 Correlation Matrix PriceTicket HomeSalary PriceTicket HomeSalary Table 14.8: Ticket Price and Home Team Salary Correlation Matrix The correlation coefficient between PriceTicket t and HomeSalary t is.78; the variables are positively correlated. The second condition required for omitted variable bias is met. We have reason to suspect bias in Model 1. When the included variable, PriceTicket t, increases the omitted variable, HomeSalary t, typically increases also. An increase in the omitted variable, HomeSalary t, increases the dependent variable, Attendance t : Typically, Included Positive omitted variable Correlation variable PriceTicket t up HomeSalary t up β Price < 0 β HomeSalary > 0 Attendance t Attendance down t up Direct Effect Proxy Effect In additional to having a direct effect on the dependent variable, the included explanatory variable, PriceTicket t, also acts as a proxy for the omitted explanatory variable, HomeSalary t. There is a proxy effect and upward bias results. This provides us with an explanation of why the ticket price coefficient estimate in Model 1 is greater than the estimate in Model 2.

20 20 Omitted Variable Summary Omitting an explanatory variable from a regression biases the estimation procedure whenever two conditions are met. Bias results if the omitted explanatory variable: influences the dependent variable; is correlated with an included explanatory variable. When these two conditions are met, the coefficient estimate of the included explanatory variable is a composite of two effects; the coefficient estimate of the included explanatory reflects the influence that the included explanatory variable itself has on the dependent variable (direct effect); omitted explanatory variable has on the dependent variable because the included explanatory variable also acts as a proxy for the omitted explanatory variable (proxy effect). The bad news is that the proxy effect leads to bias. The good news is that we can eliminate the proxy effect and its accompanying bias by including the omitted explanatory variable. But now, we shall learn that if two explanatory variables are highly correlated a different problem can emerge. Multicollinearity The phenomenon of multicollinearity occurs when two explanatory variables are highly correlated. Recall that multiple regression analysis attempts to sort out the influence of each individual explanatory variable. But what happens when we include two explanatory variables in a single regression that are perfectly correlated? Let us see. Perfectly Correlated Explanatory Variables In our baseball attendance workfile, ticket prices, PriceTicket t, are reported in terms of dollars. Generate a new variable, PriceCents t, reporting ticket prices in terms of cents rather than dollars: PriceCents t = 100 PriceTicket t Note that the variables PriceTicket t and PriceCents t are perfectly correlated. If we know one, we can predict the value of the other with complete accuracy. Just to confirm this, use statistical software to calculate the correlation matrix: Correlation Matrix PriceTicket PriceCents PriceTicket PriceCents Table 14.9: EViews Dollar and Cent Ticket Price Correlation Matrix

21 21 The correlation coefficient of PriceTicket t and PriceCents t equals The variables are indeed perfectly correlated. Now, run a regression with Attendance as the dependent variable and both PriceTicket and PriceCents as explanatory variables. Dependent variable: Attendance Explanatory variables: PriceTicket and PriceCents Your statistical software will report a diagnostic. Different software packages provide different messages, but basically the software is telling us that it cannot run the regression. Why does this occur? The reason is that the two variables are perfectly correlated. Knowing the value of one allows us to predict perfectly the value of the other with complete accuracy. Both explanatory variables contain precisely the same information. Multiple regression analysis attempts to sort out the influence of each individual explanatory variable. But if both variables contain precisely the same information, it is impossible to do this. How can we possibility separate out each variable s individual effect when the two variables contain the identical information? We are asking statistical software to do the impossible. Explanatory variables perfectly correlated Knowing the value of one explanatory value allows us to predict perfectly the value of the other Both variables contain precisely the same information Impossible to separate out the individual effect of each variable Next, we consider a case in which the explanatory variables are highly, although not perfectly, correlated.

22 22 Highly Correlated Explanatory Variables To investigate the problems created by highly correlated explanatory variable we shall use our baseball data to investigate a model that includes four explanatory variables: Attendance t = β Const + β Price PriceTicket t + β HomeSalary HomeSalary t + β HomeNW HomeNetWins t + β HomeGB HomeGamesBehind t + e t Attendance t Paid attendance for game t PriceTicket t Average price of tickets sold for game t s home team (dollars) HomeSalary t Player salaries of the home team for game t (millions of dollars) HomeNetWins t The difference between the number of wins and losses of the home team before game t HomeGamesBehind t Games behind of the home team before game t The variable HomeNetWins t equals the difference between the number of wins and losses of the home team. It attempts to capture the quality of the team. HomeNetWins t will be positive and large for a high quality team, a team that wins many more games than it losses. On the other hand, HomeNetWins t will be a negative number for a low quality team. Since baseball fans enjoy watching high quality teams, we would expect high quality teams to be rewarded with greater attendance: The variable HomeGamesBehind t captures the home team s standing in its divisional race. For those who are not baseball fans, note that all teams that win their division automatically qualify for the baseball playoffs. Ultimately, the two teams what win the American and National League playoffs meet in the World Series. Since it is the goal of every team to win the World Series, each team strives to win its division. Games behind indicates how close a team is to winning its division. To explain how games behind are calculated, consider the final standings of the American League Eastern Division in 2009: Team Wins Losses Home Net Wins Games Behind New York Yankees Boston Red Sox Tampa Bay Rays Toronto Blue Jays Baltimore Orioles Table 14.10: 2009 Final Season Standings AL East

23 23 The Yankees had the best record; the games behind value for the Yankees equals 0. The Red Sox won eight fewer games than the Yankees; hence, the Red Sox were 8 games behind. The Rays won 19 fewer games than the Yankees; hence the Rays were 19 games behind. Similarly, the Blue Jays were 28 games behind and the Orioles 39 games behind. 1 During the season if a team s games behind becomes larger, it becomes less likely the team will win its division, less likely for that team to qualify for the playoffs, and less likely for that team to eventually win the World Series. Consequently, if a team s games behind becomes larger, we would expect home team fans to become discourage resulting in less attendance. We use the terms team quality and division race to summarize our theories regarding home net wins and home team games behind: Team Quality Theory: More net wins increase attendance. β HomeNW > 0. Division Race Theory: More games behind decreases attendance. β HomeGB < 0. We would expect HomeNetWins t and HomeGamesBehind t to be negatively correlated. As HomeNetWins decreases, a team moves farther from the top of its division and consequently HomeGamesBehind t increases. We would expect the correlation coefficient for HomeNetWins t and HomeGamesBehind t to be negative. Let us check by computing their correlation matrix: [Link to MIT-ALSummer-1996.wf1 goes here.] Correlation Matrix HomeNetWins HomeGamesBehind HomeNetWins HomeGamesBehind Table 14.11: HomeNetWins and HomeGamesBehind Correlation Matrix Table reports that the correlation coefficient for HomeGamesBehind t and HomeNetWins t equals.962. Recall that the correlation coefficient must lie between 1 and +1. When two variables are perfectly negatively correlated their correlation coefficient equals 1. While HomeGamesBehind t and HomeNetWins t are not perfectly negatively correlated, they come close; they are highly negatively correlated.

24 24 We use the ordinary least squares (OLS) estimation procedure to estimate the model s parameters: Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary HomeNetWins HomeGamesBehind Const Number of Observations 585 Estimated Equation: EstAttendance = 11, PriceTicket + 668HomeSalary + 61HomeNetWins 84HomeGamesBehind Interpretation of Estimates: b PriceTicket = 437. We estimate that a $1.00 increase in the price of tickets decreases attendance by 437 per game. b HomeSalary = 668. We estimate that a $1 million increase in the home team salary increases attendance by 668 per game. b HomeGamesBehind = 84. We estimate that 1 additional game behind decreases attendance by 84 per game. Table 14.12: Attendance Regression Results The sign of each estimate supports the theories. Focus on the two new variables included in the model: HomeNetWins t and HomeGamesBehind t. Construct the null and alternative hypotheses. Team Quality Theory H 0 : β HomeNW = 0 Team quality has no effect on attendance H 1 : β HomeNW > 0 Team quality increases attendance Division Race Theory H 0 :β HomeGB = 0 Games behind has no effect on attendance H 1 : β HomeGB < 0 Games behind decreases attendance

25 25 While the signs coefficient estimates are encouraging, some of results are disappointing: The coefficient estimate for HomeNetWins t is positive supporting our theory, but what about the Prob[Results IF H 0 True]? What is the probability that the estimate from one regression would equal or more, if the H 0 were true (that is, if the actual coefficient, β HomeNW, equals 0, if home team quality has no effect on attendance)? Using the tails probability:.4778 Prob[Results IF H 0 True] =.24 2 We cannot reject the null hypothesis at the traditional significance levels of 1, 5, or 10 percent, suggesting that it is quite possible for the null hypothesis to be true, quite possible that home team quality has no effect on attendance. Similarly, The coefficient estimate for HomeGamesBehind t is negative supporting our theory, but what about the Prob[Results IF H 0 True]? What is the probability that the estimate from one regression would equal or less, if the H 0 were true (that is, if the actual coefficient, β HomeGB, equals 0, if games behind has no effect on attendance)? Using the tails probability:.6138 Prob[Results IF H 0 True] =.31 2 Again, we cannot reject the null hypothesis at the traditional significance levels of 1, 5, or 10 percent, suggesting that it is quite possible for the null hypothesis to be true, quite possible that games behind has no effect on attendance.

26 26 Should we abandon our theory as a consequence of these regression results? Let us perform a Wald test to access the proposition that both coefficients equal 0: H 0 : β HomeNW = 0 and β HomeGB = 0 Neither team quality nor games behind have an effect on attendance H 1 : β HomeNW 0 and/or β HomeGB 0 Either team quality and/or games behind have an effect on attendance Wald Test Degrees of Freedom Value Num Dem Prob F-statistic Table 14.13: EViews Wald Test Results Prob[Results IF H 0 True]: What is the probability that the F-statistic would be or more, if the H 0 were true (that is, if both β HomeNW and β HomeGB equal 0, if both team quality and games behind have no effect on attendance)? Prob[Results IF H 0 True] =.0067 We can reject the null hypothesis at a 1 percent significance level; it is unlikely that both team quality and games behind have no effect on attendance. There appears to be a paradox when we compare the t-tests and the Wald test: t-tests Wald test ã é Cannot reject the Cannot reject the Can reject the null null hypothesis null hypothesis hypothesis that both that team quality that games behind team quality and games have no effect have no effect behind have no effect on attendance. on attendance. on attendance. é ã Individually, neither team quality nor games behind appear to influence attendance Team quality and/or games behind do appear to influence attendance Individually, neither team quality nor games behind appears to influence attendance significantly; but taken together by asking if team quality and/or games behind influence attendance, we conclude that they do.

27 27 Next, let us run two regressions each of which includes only one of the two troublesome explanatory variables: Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary HomeNetWins Const Number of Observations 585 Estimated Equation: EstAttendance = 11, PriceTicket + 672HomeSalary + 100HomeNetWins Interpretation of Estimates: b PriceTicket = 449. We estimate that a $1.00 increase in the price of tickets decreases attendance by 449 per game. b HomeSalary = 672. We estimate that a $1 million increase in the home team salary increases attendance by 672 per game. b HomeNetWins = 100. We estimate that 1 additional home net win increases attendance by 100 per game. Table 14.14: EViews Attendance Regression Results HomeGamesBehind Omitted

28 28 Ordinary Least Squares (OLS) Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-statistic Prob PriceTicket HomeSalary HomeGamesBehind Const Number of Observations 585 Estimated Equation: EstAttendance = 12, PriceTicket + 671HomeSalary 194HomeGamesBehind Interpretation of Estimates: b PriceTicket = 433. We estimate that a $1.00 increase in the price of tickets decreases attendance by 433 per game. b HomeSalary = 671. We estimate that a $1 million increase in the home team salary increases attendance by 671 per game. b HomeGamesBehind = 194. We estimate that 1 additional game behind decreases attendance by 194 per game. Table 14.15: EViews Attendance Regression Results HomeNetWins Omitted When only a single explanatory variable is included the coefficient is significant. Earmarks of Multicollinearity We are observing what we shall call the earmarks of multicollinearity: Explanatory variables are highly correlated. Regression with both explanatory variables: o t-tests do not allow us to reject the null hypothesis that the coefficient of each individual variable equals 0; when considering each explanatory variable individually, we cannot reject the hypothesis that each individually has no influence. o a Wald test allows us to reject the null hypothesis that the coefficients of both explanatory variables equal 0; when considering both explanatory variables together, we can reject the hypothesis that they have no influence. Regressions with only one explanatory variable appear to produce good results.

29 29 How can we explain this? Recall that multiple regression analysis attempts to sort out the influence of each individual explanatory variable. When two explanatory variables are perfectly correlated, it is impossible for the ordinary least squares (OLS) estimation procedure to separate out the individual influences of each variable. Consequently, if two variables are highly correlated, as team quality and games behind are, it may be very difficult for the ordinary least squares (OLS) estimation procedure to separate out the individual influence of each explanatory variable. This difficulty evidences itself in the variance of the coefficient estimates probability distributions. When two highly correlated variables are included in the same regression, the variances of each estimate s probability distribution is large. This explains our t-test results. Explanatory variables Explanatory variables perfectly correlated highly correlated Knowing the value of one Knowing the value of one variable allows us to variable allows us to predict the other perfectly predict the other very accurately Both variables contain In some sense, both variables the same information contain nearly the same information Impossible to separate out their individual effects Difficult to separate out their individual effects Large variance of each coefficient estimate s probability distribution

30 30 We use a simulation to justify our explanation. Econometrics Lab 14.3: Multicollinearity Figure 14.6: Multicollinearity Simulation Our model includes two explanatory variables, x1 t and x2 t : Model: y = β Const + β x1 x1 t + β x2 x2 t + e t [Link to MIT-Lab 14.3 goes here.] By default the actual value of the coefficient for the first explanatory variable equals 2 and actual value for the second equals 5. Note that the Both Xs is selected; both explanatory variables are included in the regression. Initially, the correlation coefficient is specified as.00; that is, initially the explanatory variables are independent. Be certain that the Pause checkbox is cleared and click Start. After many, many repetitions click Stop. Next, repeat this process for a correlation coefficient of.30, a correlation coefficient of.60, and a correlation coefficient of.90.

31 31 Mean of Variance of Correlation Coef 1 Coef 1 Actual Coef 1 Parameter Estimates Estimates Table 14.16: Multicollinearity Simulation Results The simulation reveals both good news and bad news: Good news: The ordinary least squares (OLS) estimation procedure is unbiased. The mean of the estimate s probability distribution equals the actual value. The estimation procedure does not systematically underestimate or overestimate the actual value. Bad news: As the two explanatory variables become more correlated, the variance of the coefficient estimate s probability distribution increases. Consequently, the estimate from one repetition becomes less reliable. The simulation illustrates the phenomenon of multicollinearity. Irrelevant Explanatory Variables An irrelevant explanatory variable is a variable that does not influence the dependent variable. Including an irrelevant explanatory variable can be viewed as adding noise, an additional element of uncertainty, into the mix. An irrelevant explanatory variable adds a new random influence to the model. If our logic is correct, irrelevant explanatory variables should lead to both good news and bad news: Good news: Random influences do not cause the ordinary least squares (OLS) estimation procedure to be biased. Consequently, the inclusion of an irrelevant explanatory variable does not lead to bias. Bad news: The additional uncertainty added by the new random influence means that the coefficient estimate is less reliable; the variance of the coefficient estimate s probability distribution rises when an irrelevant explanatory variable is present.

32 32 We shall use our Econometrics Lab to justify our intuition. Econometrics Lab 14.4: Irrelevant Explanatory Variables Figure 14.7: Irrelevant Explanatory Variable Simulation [Link to MIT-Lab 14.4 goes here.] Once again we use a two explanatory variable model: Model: y = β Const + β x1 x1 t + β x2 x2 t + e t By default the first explanatory variable, x1 t, is the relevant explanatory variable; the default value of its coefficient is 2. The second explanatory variable, x2 t, is the irrelevant one. An irrelevant explanatory variable has no effect on the dependent variable; consequently, the actual value of its coefficient, β x2, equals 0. Initially, the Only X1 option is selected indicating that only the relevant explanatory variable, x1 t, is included in the regression; the irrelevant explanatory variable, x2 t, is not included. Click Start and then after many, many repetitions click Stop. Since the irrelevant explanatory variable is not included in the regression, correlation between the two explanatory variables should have no impact on the results. Confirm this by changing correlation coefficients from.00 to.30 in the Corr X1&X2 list. Click Start and then after many, many repetitions

33 33 click Stop. Similarly, show that the results are unaffected when the correlation coefficient is.60 and.90. Subsequently, investigate what happens when the irrelevant explanatory variable is included by selecting the Both Xs option; the irrelevant explanatory, x2 t, will now be included in the regression. Be certain that the correlation coefficient for the relevant and irrelevant explanatory variables initially equals.00. Click Start and then after many, many repetitions click Stop. Investigate how correlation between the two explanatory variables affects the results when the irrelevant explanatory variable is included by selecting correlation coefficient values of.30,.60, and.90. For each case, click Start and then after many, many repetitions click Stop. Table reports the results of the lab. Only Variable 1 Included Variables 1 and 2 Included Corr Coef Mean of Variance Mean of Variance Actual for Variables Coef 1 of Coef 1 Coef 1 of Coef 1 Coef 1 1 and 2 Estimates Estimates Estimates Estimates Table 14.17: Irrelevant Explanatory Variable Simulation Results The results reported in Table are not surprising; the results support our intuition: Only Relevant Variable (Variable 1) Included: o The mean of the coefficient estimate for relevant explanatory variable, x1 t, equals 2, the actual value; consequently, the ordinary least squares (OLS) estimation procedure for the coefficient estimate is unbiased. o Naturally, the variance of the coefficient estimate is not affected by correlation between the relevant and irrelevant explanatory variables because the irrelevant explanatory variable is not included in the regression. Both Relevant and Irrelevant Variables (Variables 1 and 2) Included: o The mean of the coefficient estimates for relevant explanatory variable, x1 t, still equals 2, the actual value; consequently, the ordinary least squares (OLS) estimation procedure for the coefficient estimate is unbiased.

Chapter 15: Other Regression Statistics and Pitfalls

Chapter 15: Other Regression Statistics and Pitfalls Chapter 15 Outline Two-Tailed Confidence Intervals o Confidence Interval Approach: Which Theories Are Consistent with the Data? o A Confidence Interval