Empirical Application of Panel Data Regression

Similar documents
Lecture 9: Panel Data Model (Chapter 14, Wooldridge Textbook)

Final Exam. 1. Definitions: Briefly Define each of the following terms as they relate to the material covered in class.

Quantitative Methods Final Exam (2017/1)

Econometrics Homework 4 Solutions

Empirical Application of Simple Regression (Chapter 2)

Applied Economics. Panel Data. Department of Economics Universidad Carlos III de Madrid

Fortin Econ Econometric Review 1. 1 Panel Data Methods Fixed Effects Dummy Variables Regression... 7

Problem Set 10: Panel Data

CRE METHODS FOR UNBALANCED PANELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M.

Simultaneous Equations with Error Components. Mike Bronner Marko Ledic Anja Breitwieser

Exam ECON3150/4150: Introductory Econometrics. 18 May 2016; 09:00h-12.00h.

Jeffrey M. Wooldridge Michigan State University

Please discuss each of the 3 problems on a separate sheet of paper, not just on a separate page!

Nursing Facilities' Life Safety Standard Survey Results Quarterly Reference Tables

1 The basics of panel data

Your Galactic Address

(a) Briefly discuss the advantage of using panel data in this situation rather than pure crosssections

Lecture 26 Section 8.4. Mon, Oct 13, 2008

Fixed and Random Effects Models: Vartanian, SW 683

Introduction to Econometrics

****Lab 4, Feb 4: EDA and OLS and WLS

Introduction to Econometrics. Regression with Panel Data

Lab 07 Introduction to Econometrics

Essential of Simple regression

Binary Dependent Variables

Problem Set 5 ANSWERS

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Lab 10 - Binary Variables

2. We care about proportion for categorical variable, but average for numerical one.

Exercices for Applied Econometrics A

Applied Statistics and Econometrics

ECON3150/4150 Spring 2016

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Applied Econometrics. Lecture 3: Introduction to Linear Panel Data Models

Lecture 3 Linear random intercept models

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Capital humain, développement et migrations: approche macroéconomique (Empirical Analysis - Static Part)

Analyzing Severe Weather Data

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

EXST 7015 Fall 2014 Lab 08: Polynomial Regression

Lab 6 - Simple Regression

Motivation for multiple regression

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Monday 7 th Febraury 2005

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

Handout 11: Measurement Error

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Use your text to define the following term. Use the terms to label the figure below. Define the following term.

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

Lecture 7: OLS with qualitative information

Module 19: Simple Linear Regression

Sample Statistics 5021 First Midterm Examination with solutions

Handout 12. Endogeneity & Simultaneous Equation Models

Introduction to Econometrics

Outline. Linear OLS Models vs: Linear Marginal Models Linear Conditional Models. Random Intercepts Random Intercepts & Slopes

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

ECO220Y Simple Regression: Testing the Slope

ECON3150/4150 Spring 2016

ECON 497 Final Exam Page 1 of 12

Econ 1123: Section 5. Review. Internal Validity. Panel Data. Clustered SE. STATA help for Problem Set 5. Econ 1123: Section 5.

General Linear Model (Chapter 4)

Applied Statistics and Econometrics

ECON2228 Notes 7. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 41

Basic Regressions and Panel Data in Stata

Lecture 8: Instrumental Variables Estimation

Lab 11 - Heteroskedasticity

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Question 1 [17 points]: (ch 11)

Regression Diagnostics

Applied Statistics and Econometrics

Applied Statistics and Econometrics

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

point estimates, standard errors, testing, and inference for nonlinear combinations

THE MULTIVARIATE LINEAR REGRESSION MODEL

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

Lecture 14. More on using dummy variables (deal with seasonality)

Regression with Panel Data

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

Week 3: Simple Linear Regression

Practice 2SLS with Artificial Data Part 1

Auto correlation 2. Note: In general we can have AR(p) errors which implies p lagged terms in the error structure, i.e.,

Statistical Modelling in Stata 5: Linear Models

Practice exam questions

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Autocorrelation. Think of autocorrelation as signifying a systematic relationship between the residuals measured at different points in time

Class business PS is due Wed. Lecture 20 (QPM 2016) Multivariate Regression November 14, / 44

At this point, if you ve done everything correctly, you should have data that looks something like:

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Longitudinal Data Analysis. RatSWD Nachwuchsworkshop Vorlesung von Josef Brüderl 25. August, 2009

Chapter 2: simple regression model

Lecture#12. Instrumental variables regression Causal parameters III

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Test of Convergence in Agricultural Factor Productivity: A Semiparametric Approach

Applied Quantitative Methods II

EC327: Advanced Econometrics, Spring 2007

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

MSc Economic Policy Studies Methods Seminar. Stata Code and Questions sheet: Computer lab session 24 th October

Transcription:

Empirical Application of Panel Data Regression 1. We use Fatality data, and we are interested in whether rising beer tax rate can help lower traffic death. So the dependent variable is traffic death, while the key regressor is the beer tax rate. 2. The data is a csv file. The command to read csv file is insheet (or use menu).. insheet using "I:\420\420_data_fatality.csv" (9 vars, 336 obs). rename statename state. label variable y "number of annual traffic death per 10,000 people". encode state, gen(id). list state id year beertax y in 1/10, nol +----------------------------------------+ state id year beertax y ---------------------------------------- 1. AL 1 1982 1.539379 2.12836 2. AL 1 1983 1.788991 2.34848 3. AL 1 1984 1.714286 2.33643 4. AL 1 1985 1.652542 2.19348 5. AL 1 1986 1.609907 2.66914 ---------------------------------------- 6. AL 1 1987 1.56 2.71859 7. AL 1 1988 1.501444 2.49391 8. AR 2 1982.650358 2.38405 9. AR 2 1983.6754587 2.3957 10. AR 2 1984.5989011 2.23785 +----------------------------------------+. xtset id panel variable: id (balanced) We observe each state repeatedly, so this is panel data. Each state is a panel (group, cluster). Here each panel has 7 observations; each observation is state-year. 3. The variable state is string, for which we can generate a categorical variable using encode state, gen(id). Then we use id to declare this is a panel data, and Stata 1

finds it is balanced, i.e., there is no missing value at some year for some state. This can also be seen by using command. tab year year Freq. Percent Cum. ------------+----------------------------------- 1982 48 14.29 14.29 1983 48 14.29 28.57 1988 48 14.29 100.00 ------------+----------------------------------- Total 336 100.00 4. We can construct panel data by stacking the cross section in one year above another (using commands append or merge). To see this,. sort year id. list state year y in 1/50 +------------------------+ state year y ------------------------ 1. AL 1982 2.12836 2. AR 1982 2.38405 49. AL 1983 2.34848 50. AR 1983 2.3957 +------------------------+ The first 48 observations are the 1982 cross section, followed by the 1983 cross section, and so on. It is important to have id variable (Stata calls it group variable) such as state in each cross section. 5. In panel data there are three types of variables: (1) variables that are time-invariant such as state; (2) variables that are panel-invariant such as year; and (3) variables that vary over time and across panels such as beer tax and y. Later we will show the fixed effect (FE) and first difference (FD) estimators cannot estimate the effect of (1) variables that are time-invariant. We can use xtsum to tell which type a variable is: 2

. xtsum y year id Variable Mean Std. Dev. Min Max Observations -----------------+--------------------------------------------+---------------- y overall 2.040443.5701951.821 4.21784 N = 336 between.5461418 1.110047 3.653197 n = 48 within.1794263 1.455559 2.962663 T = 7 year overall 1985 2.002983 1982 1988 N = 336 between 0 1985 1985 n = 48 within 2.002983 1982 1988 T = 7 id overall 24.5 13.87406 1 48 N = 336 between 14 1 48 n = 48 within 0 24.5 24.5 T = 7 A variable is panel-invariant if its between standard deviation is zero; it is timeinvariant if within standard deviation is zero. 6. First, we consider using only one year data (cross section). The scatter plot and simple regression both indicate positive correlation between beer tax and traffic death, which is ridiculous.. twoway (sca y beertax if year==1982, ml(state)) (lfit y beertax if year==1982) NM 1 2 3 4 WY NV MT OK TX ID AZ LA AR CO KY TN WV ND SD DE KS VT OR CA UT INMO NH WA WI CT IA NE VA IL MD PA MI NJ OH ME MN NY RI MA FL MS NCAL SC GA 0 1 2 3 beertax number of annual traffic death per 10,000 people Fitted values 3

. reg y beertax if year==1982, nohe beertax.1484603.1883682 0.79 0.435 -.2307051.5276258 _cons 2.010381.1390785 14.46 0.000 1.730431 2.290332 7. Running a pooled regression that uses all years only helps a little. reg y beertax Source SS df MS Number of obs = 336 -------------+------------------------------ F( 1, 334) = 34.39 Model 10.1687134 1 10.1687134 Prob > F = 0.0000 Residual 98.7473114 334.295650633 R-squared = 0.0934 -------------+------------------------------ Adj R-squared = 0.0906 Total 108.916025 335.325122462 Root MSE =.54374 beertax.3646064.06217 5.86 0.000.2423124.4869005 _cons 1.853307.0435672 42.54 0.000 1.767606 1.939007 Notice that the sample size in pooled regression is 48 7 = 336, resulting in bigger t value. 8. This across-state comparison, whether using one year or all years, is prone to omitted variable bias. Factors like drinking culture are unobserved, but they can affect both beer tax and traffic death. If we believe drinking culture is (largely) time-invariant, then over-time comparison (time series regression) for given state is more meaningful:. two (sca y beertax if state=="al", ml(year)) (lfit y beertax if state=="al") 4

2 2.2 2.4 2.6 2.8 1988 1982 1987 1986 1985 1984 1983 1.5 1.6 1.7 1.8 beertax number of annual traffic death per 10,000 people Fitted values. reg y beertax if state=="al", nohe beertax -.5237809.9580534-0.55 0.608-2.986535 1.938974 _cons 3.263139 1.558319 2.09 0.090 -.7426469 7.268924 Now we see a negative ˆβ 1, though it is insignificant. 9. Intuitively, the over-time or within comparison is more convincing because we can safely rule out the effect of time-constant unobserved factors such as drinking culture something that is fixed over time cannot be used to explain the over time variation in y. 10. If we assume the marginal effect of beertax on traffic death does not change across states, then we can improve our estimate by using all states. Toward that end, we need to generate the panel-specific first (or one-year) difference of traffic death and beer tax dy1 = y i,t y i,t 1 (1) dx1 = beertax i,t beertax i,t 1 (2) where y i,t 1 denotes the first lag of y i. Notice that we need two subscripts. The first 5

subscript i indexes panel, and the second subscript t indexes time. Here we compute the first difference for each given i. In stata, that means by id.. sort id year. by id: gen ylag1 = y[_n-1] (48 missing values generated). by id: gen dy1 = y-ylag1 (48 missing values generated). list id year y ylag1 dy1 in 1/10 +-------------------------------------------+ id year y ylag1 dy1 ------------------------------------------- 1. AL 1982 2.12836.. 2. AL 1983 2.34848 2.12836.22012 3. AL 1984 2.33643 2.34848 -.0120499 7. AL 1988 2.49391 2.71859 -.2246799 8. AR 1982 2.38405.. 9. AR 1983 2.3957 2.38405.0116501 10. AR 1984 2.23785 2.3957 -.15785 +-------------------------------------------+ Notice that we get the first lag of y (called ylag1) by pushing the y series one-period downward, so one missing value is generated. It is worth emphasizing that we do this separately for each state using by id. The first non-missing observation in the first difference series (called dy1) is y 1,2 y 1,1 = 2.34848 2.12836 =.22012. Next we get the panel-specific first difference of beertax, and apply OLS to the difference data:. by id: gen dx1 = beertax[_n]-beertax[_n-1] (48 missing values generated). reg dy1 dx1, nohe dy1 Coef. Std. Err. t P> t [95% Conf. Interval] dx1 -.6974794.2018674-3.46 0.001-1.094814 -.3001452 6

_cons -.0048663.016771-0.29 0.772 -.0378765.0281439 This first or one-year difference estimate is negative and significant. Good news. 11. To understand why the FD estimator works, consider the structural models: y i,t = β 1 x i,t + β 2 w i + u i,t (3) y i,t 1 = β 1 x i,t 1 + β 2 w i + u i,t 1 (4) where w i does not have time subscript t because it denotes the time-invariant variable (like drinking culture). Subtracting (4) from (3) leads to y i,t = β 1 x i,t + u i,t (5) The FD estimator is just OLS applied to (5). Notice that w i has been removed by taking difference. So OLS applied to the differenced data, model (5), is not subject to omitted variable bias. By contrast, the OLS applied to pooled regression (3) suffer the omitted variable bias β 2cov(x i,t w i ) var(x i,t ). 12. It also becomes clear that the FD estimator cannot estimate the effect of any factor that is time-invariant (such as a dummy variable called south that equals one if a state is in south part of the country). 13. (Optional) FD estimator is consistent as long as cov( x i,t, u i,t ) = 0. A sufficient condition is strict exogeneity: cov(x i,tj, u i,tk ) = 0, t j, t k. 14. (Optional) In general, the error term in (5) is serially correlated for given i : cov( u i,t, u i,t k ) 0, k = 1,..., T 1. So cluster-robust standard error should be used in theory. 15. Even though ˆβ 1 is obtained from (5), its interpretation is still the change in y when x changes by one unit, i.e., in terms of (3). 16. To avoid the ambiguity of using one-year-difference or two-year-difference, we may consider the one-way fixed effect (FE) estimator 7

. xtreg y beertax, fe Fixed-effects (within) regression Number of obs = 336 Group variable: id Number of groups = 48 R-sq: within = 0.0407 Obs per group: min = 7 between = 0.1101 avg = 7.0 overall = 0.0934 max = 7 F(1,287) = 12.19 corr(u_i, Xb) = -0.6885 Prob > F = 0.0006 beertax -.6558746.1878511-3.49 0.001-1.025615 -.286134 _cons 2.377075.0969705 24.51 0.000 2.186211 2.567938 sigma_u.71471595 sigma_e.18986054 rho.93408435 (fraction of variance due to u_i) F test that all u_i=0: F(47, 287) = 52.18 Prob > F = 0.0000 The FE estimator is based on the so called within regression T t=1 y i,t y i,t ȳ i = β 1 (x i,t x i ) + (u i,t ū i ) (6) T t=1 x i,t where ȳ i =, x T i = are the panel-specific means. The process of subtracting the panel-specific mean is called within transformation. The point is, the T within transformation, akin to taking difference, can remove the time-invariant unobserved heterogeneity: w i w i = 0. In other words, the FE estimator is not subject to the bias of omitting time-invariant factors. The downside is, FE estimator cannot estimate the effect of any time-invariant factors. 17. We can obtain demeaned value by regressing a variable on constant term. Likewise, we need panel-specific constant term or panel-specific dummy variable in order to generate the panel-specific demeand value. That is why FE estimate can be produced 8

in the dummy variable regression (DVR), one that includes all except one state dummy variable:. qui tab state, gen(sd). reg y beertax sd1-sd47 beertax -.6558746.1878511-3.49 0.001-1.025615 -.286134 sd1.2285051.3128977 0.73 0.466 -.3873603.8443705 sd47 -.66825.1239813-5.39 0.000 -.9122779 -.424222 _cons 3.249126.0723288 44.92 0.000 3.106764 3.391488 We drop one state dummy to avoid dummy variable trap. Notice that ˆβ 1 FE, a fact that FW theorem can prove. ˆβ DVR 1 =.6558746 = 18. Intuitively, the DVR resolves the omitted-variable issue by using panel-specific dummy variables as proxy for w i. This is a good idea since both w i and the dummy variable are time constant. 19. Now we can test the null hypothesis that the state fixed effect (state dummies) is insignificant (H 0 : β 2 = 0 or equivalently H 0 : w i = 0).. testparm sd1-sd47 ( 1) sd1 = 0 (47) sd47 = 0 F( 47, 287) = 52.18 Prob > F = 0.0000 So the null hypothesis is rejected. This F test is reported by xtreg command. 20. We can also test the significance of beertax 9

. test beertax ( 1) beertax = 0 F( 1, 287) = 12.19 Prob > F = 0.0006 This F test is also reported by xtreg command. 21. Now we can understand why the pooled OLS is biased. Basically, it is because the pooled OLS omits state dummies, which are correlated with beertax:. qui reg beertax sd1-sd47. dis "F test for exogeneity of beertax is " e(f) F test for exogeneity of beertax is 452.72165 22. The first and third F tests, 52.18 and 452.72165, jointly explain why the pooled OLS is biased. 23. In a similar fashion, we can obtain the two-way fixed effect estimator by including in the DVR time dummy variables, which serve as proxy for panel-invariant unobserved effect (like the safety feature of cars).. qui tab year, gen(yd). reg y beertax sd1-sd47 yd1-yd6, nohe beertax -.6399771.1973779-3.24 0.001-1.028504 -.2514501 sd1.2034567.326807 0.62 0.534 -.439844.8467574 yd1.0518036.0396237 1.31 0.192 -.0261933.1298006 yd6.0009017.0384706 0.02 0.981 -.0748254.0766288 _cons 3.25611.0753773 43.20 0.000 3.107735 3.404486 24. Or, we can use the xtreg command along with time dummy variables 10

. xtreg y beertax yd1-yd6, fe Fixed-effects (within) regression Number of obs = 336 Group variable: id Number of groups = 48 beertax -.6399771.1973779-3.24 0.001-1.028504 -.2514501 yd6.0009017.0384706 0.02 0.981 -.0748254.0766288 _cons 2.376665.0985112 24.13 0.000 2.182751 2.570579 25. Finally, it is important to use cluster-robust standard error in order to account for the serial correlation among the error terms for given panel, i.e., between u i,t ū i and u i,t k ū i :. xtreg y beertax yd1-yd6, fe vce(cluster id) Fixed-effects (within) regression Number of obs = 336 Group variable: id Number of groups = 48 (Std. Err. adjusted for 48 clusters in id) Robust beertax -.6399771.3570782-1.79 0.080-1.358326.0783716 yd1.0518036.0644023 0.80 0.425 -.077757.1813643 yd6.0009017.0287037 0.03 0.975 -.0568428.0586461 _cons 2.376665.1673381 14.20 0.000 2.040024 2.713306 sca hatbeta1 = _b[beertax] The cluster-robust standard error can account for both across-panel heteroskedasticity and within-panel correlation. 11

26. We can test the year fixed effect:. testparm yd* ( 1) yd1 = 0 ( 2) yd2 = 0 ( 3) yd3 = 0 ( 4) yd4 = 0 ( 5) yd5 = 0 ( 6) yd6 = 0 F( 6, 47) = 4.22 Prob > F = 0.0018 In this case, the year fixed effect (yearly dummies) is significant at 1% level. So omitted variable bias would arise if we forget to include yearly dummies. 27. The two-way FE estimate ˆβ two way FE 1 =.6399771 is economically significant since. qui sum y. dis "percentage change in y is " hatbeta1/r(mean) percentage change in y is -.31364613 traffic death drops by about 31% after tax rate changes by one unit. 28. Remember the command xtreg y x timedummy, fe vce(cluster id) reports the two-way fixed effect estimator with cluster-robust standard error. It is the most commonly-used command for applied economists. Equivalently, you can get the same ˆβ 1 by using reg y x paneldummy timedummy The FE estimator cannot be used if the key regressor is time-invariant; it is biased if the unobserved heterogeneity is time-varying. 29. (Optional) The command to do within transformation, and obtain within, between and random effect estimators are 12

xtreg y beertax, fe * within transformation bysort id: egen ybar = mean(y) bysort id: gen dmy = y - ybar bysort id: egen xbar = mean(beertax) bysort id: gen dmx = beertax - xbar * within regression reg dmy dmx * panel-specific intercept term (u_i in stata and a_i in textbook) gen u = ybar - _b[dmx]*xbar * standard deviation of u_i sum u if year==1982 * Between regression 14.2 in the textbook sort id year reg ybar xbar if year==1982 * Random effect model xtreg y beertax, re 13