Introduction to Panel Data Analysis

Size: px

Start display at page:

Download "Introduction to Panel Data Analysis"

Gerard Holland
6 years ago
Views:

1 Introduction to Panel Data Analysis Youngki Shin Department of Economics Statistics and Data Series at Western November 21, / 40

2 Motivation More observations mean more information. 2 / 40

3 Motivation More observations mean more information. More observations with a certain structure mean much more information: pooled cross sections and panel data 2 / 40

4 Motivation More observations mean more information. More observations with a certain structure mean much more information: pooled cross sections and panel data How can we extract additional information from pooled cross sections or panel data? 2 / 40

5 Example Effect of an Incinerator on Housing Prices With cross-sectional data in 1981, we have rprice = 101, , nearinc (3, 093.0) (5, ) n = / 40

6 Example Effect of an Incinerator on Housing Prices With cross-sectional data in 1981, we have rprice = 101, , nearinc n = 142 (3, 093.0) (5, ) With another cross-sectional data in 1978 when there were no incinerator, we have rprice = 82, , nearinc n = 179 (2, ) (4, ) 3 / 40

7 Example Effect of an Incinerator on Housing Prices With cross-sectional data in 1981, we have rprice = 101, , nearinc n = 142 (3, 093.0) (5, ) With another cross-sectional data in 1978 when there were no incinerator, we have rprice = 82, , nearinc n = 179 (2, ) (4, ) Therefore, the true effect of the incinerator in not 30, but 30, ( 18, ) = 11, / 40

8 Outline Data Structure Policy Evaluation with Pooled Cross Sections Three Approaches in Panel Data Estimation First Difference (FD) Estimator Fixed Effect (FE) Estimator Random Effect (RE) Estimator Empirical Application: Smoking on Birth Outcomes Concluding Remarks 4 / 40

9 Outline Data Structure Policy Evaluation with Pooled Cross Sections Three Approaches in Panel Data Estimation First Difference (FD) Estimator Fixed Effect (FE) Estimator Random Effect (RE) Estimator Empirical Application: Smoking on Birth Outcomes Concluding Remarks 5 / 40

10 Data Structure (cont.) A set of pooled cross sections is obtained by sampling randomly from a large population at different time points. A (typical) panel data set follow the same individuals over time. For example, consider that I sample three individuals from this room at two time points: Time Pooled Panel t=1 John, Jane, Evelyn Eric, Andrew, Rachel t=2 Kyle, Justin, Lisa Eric, Andrew, Rachel 6 / 40

11 Data Structure (cont.) A Snapshot of Data Table: Pooled Data year rprice nearinc y Table: Panel Data id year inf unem / 40

12 Data Structure (cont.) There are also very useful panel structures other than the individual-time combination. 1 Twins data: i is for twins id, and t is for the individual among the specific twins. Control for unobserved generic factors. 2 School data: students sampled from many schools (or classrooms). Then, i is for school id, and t is for the student in school i. 8 / 40

13 Data Structure (cont.) Examples of pooled cross sections: Current Population Survey (CPS), USA Examples of a panel data: Labor Market Activity Survey (LMAS), Canada Panel Study of Income Dynamics (PSID), USA National Longitudinal Survey of Youth (NLSY), USA A time series of provincial (or country) level data. ex) inflation and unemployment rate of 50 countries in It is usually easier to collect pooled cross sections than to do panel data. 9 / 40

14 Outline Data Structure Policy Evaluation with Pooled Cross Sections Three Approaches in Panel Data Estimation First Difference (FD) Estimator Fixed Effect (FE) Estimator Random Effect (RE) Estimator Empirical Application: Smoking on Birth Outcomes Concluding Remarks 10 / 40

15 Policy Evaluation with Pooled Cross Sections Difference-in-Difference Estimator Terminology: Treatment Group: those who are affected by a policy (a treatment) Control Group: those who are not. The object of policy evaluation is to measure the (mean) difference of outcomes between the treatment group and the control group. This measure is also called the average treatment effect. Consider that you are testing the effect of a new drug. How can you design the experiment? Randomization. Recall the incinerator and housing prices example. Is randomization possible? 11 / 40

16 Policy Evaluation with Pooled Cross Sections Difference-in-Difference Estimator Consider an example of a drug test: blprs i = β 0 + β 1 treat i + u i If you randomized the control/treatment groups well, i.e. Cov(treat i, u i ) = 0, then you can estimate the effect of the drug by a single cross section. In policy evaluation in social sciences, treat i and u i are easily correlated: log(wage i ) = β 0 + β 1 jbtrn i + u i 12 / 40

17 Policy Evaluation with Pooled Cross Sections Difference-in-Difference Estimator Pooled cross sections help us to evaluate the policy effect correctly by measuring the difference twice (before and after the policy implementation.) Recall the two regressions in the incinerator example: rprice = γ 0 + γ 1 nearinc + u in years 1978 and 1981 ˆδ 1 = ˆγ 1,81 ˆγ 1,78 = ( rprice 81,nr rprice 81,fr ) ( rprice78,nr rprice 78,fr ) If perfectly randomized, the second term is 0. This estimator is called the Difference-in-Difference estimator. 13 / 40

18 Policy Evaluation with a Pooled Cross Section Difference-in-Difference Estimator The effect can be estimated just by a single regression with some dummy variable. rprice = β 0 + δ 0 y81 + β 1 nearinc + δ 1 y81 nearinc + u This result is not intuitive. Just follow the logic: Before (y81 = 0) After (y81 = 1) After-Before Control (nearinc = 0) β 0 β 0 + δ 0 δ 0 Treatment (nearinc = 1) β 0 + β 1 β 0 + δ 0 + β 1 + δ 1 δ 0 + δ 1 Treatment-Control β 1 β 1 + δ 1 δ 1 Therefore, δ 1 in the above regression gives the same estimate of the Difference-in-Difference estimator. 14 / 40

19 Outline Data Structure Policy Evaluation with Pooled Cross Sections Three Approaches in Panel Data Estimation First Difference (FD) Estimator Fixed Effect (FE) Estimator Random Effect (RE) Estimator Empirical Application: Smoking on Birth Outcomes Concluding Remarks 15 / 40

20 Panel Data and the First Difference (FD) Estimator In panel data, we follow the same individual over time. This specific structure enables us to conduct a better analysis. Specifically, we can control for certain types of omitted variables called unobserved heterogeneity. Let us think about some examples: log(wage it ) = β 0 + δ 0 d2 t + β 1 educ it + a i + u it }{{} v it Notation: now we have two subscripts, i and t. Both a i and u it are unobservables called a fixed effect and an idiosyncratic error, respectively. 16 / 40

21 Panel Data and the First Difference (FD) Estimator For simplicity, consider two periods model: y it = β 0 + δ 0 d2 t + β 1 x it + a i + u it t = 1, 2. The pooled OLS does not work well since a i is usually correlated with x it, i.e. Cov(v it, x it ) 0. A simple solution is the First-Difference (FD) estimator. y i2 = (β 0 + δ 0 ) + β 1 x i2 + a i + u i2 t = 2 y i1 = β 0 + β 1 x i1 + a i + u i1 t = 1 Taking a difference gives y i2 y i1 = δ 0 + β 1 (x i2 x i1 ) + (u i2 u i1 ) or y i = δ 0 + β 1 x i + u i. 17 / 40

22 Panel Data and the First Difference (FD) Estimator The (pooled) OLS works in the new regression, 1 u i and x i are uncorrelated; 2 x i has some variation. y i = δ 0 + β 1 x i + u i, if The second condition is violated if x it does not change over time: ex) gender, race, etc.. Then, x i = 0. Even in the wage equation example, log(wage it ) = β 0 + δ 0 d2 t + β 1 educ it + a i + u it, Most working population do not increase the years of educ. 18 / 40

23 Panel Data and the First Difference (FD) Estimator More than Two Time Periods When panel data contain more than two time periods, we can still apply the FD estimator to control for unobserved heterogeneity. The sufficient condition for the estimator to be valid is This condition is violated when Cov(x it, u is ) = 0 for all t and s. 1 Future regressors react to the past dependent variable (feedback); 2 Regressors contain a lagged dependent variable; 3 An important (i.e. related to x it ) time-varying regressor is omitted. Take differences with adjacent time periods and run the following regression when t = 1, 2, and 3: y it = α 0 + α 3 d3 t + β 1 x it + u it for t = 2, / 40

24 Additional Remarks on FD Estimator Due to the expansion over the time dimension, serial correlation may arise. Also, we cannot exclude the heteroskedasticity problem. Since we use the OLS estimator, we can apply the White correction or the HAC estimation method as before. 20 / 40

25 Fixed Effect Estimator Consider a simple error component model again: y it = β 1 x it + a i + u it, t = 1,..., T and i = 1,..., n. We assume that the idiosyncratic error u it is innocuous in the sense: E(u it X i ) = 0 or E(u it x it ) = 0. However, the individual fixed effect a i could be arbitrarily correlated with x it. We have already known that the FD estimator cancels out the unobserved heterogeneity a i. 21 / 40

26 Fixed Effect Estimator There is a different way to cancel out unobserved heterogeneity. First, fix the individual i and take an average over time: ȳ i = β 1 x i + a i + ū i. where ȳ i = 1 T T y it, x i = 1 T t=1 T x it, and ū i = 1 T t=1 T u it. t=1 The point is ā i = 1 T T a i = 1 T Ta i = a i. t=1 22 / 40

27 Fixed Effect Estimator Now, take a difference between two equations: y it = β 1 x it + a i + u it, t = 1, 2,..., T. ȳ i = β 1 x i + a i + ū i. Then, what we have is y it ȳ i = β 1 (x it x i ) + (u it ū i ), t = 1, 2,..., T or ÿ it = β 1 ẍ it + ü it, t = 1, 2,..., T. We may apply the pooled OLS on the last equation. 23 / 40

28 Fixed Effect Estimator The FE estimator uses information from within group (i) variation: ÿ i1 = y i1 ȳ i ÿ i2 = y i2 ȳ i. ÿ it = y it ȳ i For this reason, the FE estimator is also called within estimator. This can be readily extended to a multiple regression model: ÿ it = β 1 ẍ 1it + β 2 ẍ 2it β k ẍ kit + ü it 24 / 40

29 Fixed Effect Estimator FD vs. FE If T = 2, the FD estimator and the FE estimator are identical: ( ) yi1 + y i2 ÿ i2 y i2 ȳ i = y i2 = y 1 y y i2. Therefore, ÿ i2 = β 1 ẍ it + ü it 1 2 y i2 = β x i u i2 y i2 = β 1 x i2 + u i2 However, they are different in a finite sample if T > 2. Unless there is a unit root (or severe serial correlation) problem, you would better use the FE estimator. 25 / 40

30 Random Effect Estimator In the random effect model: we assume that y it = β 0 + β 1 x it + a i + u it, Cov(x it, a i ) = 0. Then, we come back to the nice world where we don t need to cancel out a i. Just use the pooled OLS? 26 / 40

31 Random Effect Estimator In the random effect model: we assume that y it = β 0 + β 1 x it + a i + u it, Cov(x it, a i ) = 0. Then, we come back to the nice world where we don t need to cancel out a i. Just use the pooled OLS? No. There is a serial correlation problem. 26 / 40

32 Random Effect Estimator Serial Correlation in the RE model We have two components in the error term: v it = a i + u it Suppose that u it is totally innocuous again: Cov(a i, u it ) = Cov(u it, u is ) = 0 for t s. Now, we calculate Corr(v it, v is ) and show that it is not zero: Var(v it ) = Var(a i + u it ) = σ 2 a + σ 2 u Cov(v it, v is ) = E ((a i + u it )(a i + u is )) = E(a 2 i + a i u is + a i u it + u it u is ) = E(a 2 i ) = σ 2 a 27 / 40

33 Random Effect Estimator Serial Correlation in the RE model Therefore, Corr(v it, v is ) = σ2 a σ 2 a + σ 2 u 0 Any inference based on the pooled OLS would be incorrect. However, we know how to fix this problem. Do GLS! We want to transform the original model into ỹ it = β 0 + β 1 x it + ṽ it where ṽ it does not have the serial correlation anymore. 28 / 40

34 Random Effect Estimator We multiplied ρ and took a difference when there is a AR(1) serial correlation. In this case, we multiply and take a difference as [ σu 2 λ = 1 σu 2 + T σa 2 ] (1/2) y it λȳ i = β 0 (1 λ) + β 1 (x it λ x i ) + v it λ v i We can show that ṽ it (= v it λ v i ) is not serially correlated. The λ should be estimated by ˆλ. This specific GLS estimator is called the Random Effect (RE) estimator. 29 / 40

35 Random Effect Estimator The RE estimator is something between the pooled OLS and the FE estimator. Note that in Equation: y it λȳ i = β 0 (1 λ) + β 1 (x it λ x i ) + v it λ v i, it becomes the pooled OLS when λ = 0, and does the FE estimator when λ = 1. The λ is always between 0 and 1 in the RE model. As T, the FE and RE estimators are equivalent since λ / 40

36 Random Effect Estimator RE vs. FE If you believe that there is obvious endogenous fixed factor, a i, in your model, you should use the FE estimator. Otherwise, the RE estimator will tell you more: non time-varying regressors, efficiency etc. Keep in mind that the RE estimator is not even consistent if Cov(x it, a i ) 0. We can test whether Cov(x it, a i ) = 0 or not. 31 / 40

37 Random Effect Estimator Hausman Test The idea of the Hausman test is simple. The null hypothesis is H 0 :Cov(x it, a i ) = 0 H 1 :Cov(x it, a i ) 0 Under H 0, both RE and FE are consistent: p p β RE β, βfe β. Thus, we can expect that β RE β FE. However, under H 1, only β FE is consistent. Therefore, we reject H 0 if the difference between β RE and β FE is large enough. 32 / 40

38 Outline Data Structure Policy Evaluation with Pooled Cross Sections Three Approaches in Panel Data Estimation First Difference (FD) Estimator Fixed Effect (FE) Estimator Random Effect (RE) Estimator Empirical Application: Smoking on Birth Outcomes Concluding Remarks 33 / 40

39 Empirical Application: Smoking on Birth Outcomes Infants born to women who smoke during pregnancy have a lower average birthweight... Low birthweight is associated with increased risk for neonatal, perinatal, and infant morbidity and mortality. (Women and Smoking: A Report of the Surgeon General, 2001, requoted from Abrevaya (2006)) 34 / 40

40 Empirical Application: Smoking on Birth Outcomes The direct medical costs: According to the estimates of Lewit et al. (1995), the low-birthweight (LBW) infants (less than 10% of births) account for more than 1/3 of health care costs during the first year of life. The long-term costs: Hack et al. (1995) find that LBW babies have developmental problems in cognition, attention and neuromotor functioning that persist until adolescence. (Abrevaya (2006)) 35 / 40

41 Empirical Application: Smoking on Birth Outcomes How to Estimate The OLS estimates would be biased into the negative direction due to endogeneity. IV estimation? 492 Comparison between J. ABREVAYA OLS and IV estimates from Abrevaya (2006) 36 / 40

42 Empirical Application: Smoking on Birth Outcomes The fixed-effect (FE) estimation can be used if panel data are available. Abrevaya (2006) constructed a pseudo panel data set and showed that the FE estimate is smaller than that of the OLS. y ib = x ib β + γs ib + c i + u ib where i is Mom s id and b is the order of a baby from Mom i. The estimation results for γ by OLS and FE are (3.20) (4.75), respectively. 37 / 40

43 Concluding Remarks 38 / 40

44 Concluding Remarks Pooled cross sections are very similar to a single cross section, but observations across different time points help evaluate the correct policy effect. 38 / 40

45 Concluding Remarks Pooled cross sections are very similar to a single cross section, but observations across different time points help evaluate the correct policy effect. Extra information contained in panel data enables us to control for the individual fixed effect by FD and FE estimators. 38 / 40

46 Concluding Remarks Pooled cross sections are very similar to a single cross section, but observations across different time points help evaluate the correct policy effect. Extra information contained in panel data enables us to control for the individual fixed effect by FD and FE estimators. If the fixed effect is not correlated with regressors, we can apply RE estimator, which is a GLS estimator. 38 / 40

47 Concluding Remarks Pooled cross sections are very similar to a single cross section, but observations across different time points help evaluate the correct policy effect. Extra information contained in panel data enables us to control for the individual fixed effect by FD and FE estimators. If the fixed effect is not correlated with regressors, we can apply RE estimator, which is a GLS estimator. Panel data are not restricted to the individual-time structure. 38 / 40

48 Stata Commands Load the data set filename.dta. First, we need to set an id variable and a time variable. Check the relevant variable names. xtset id time. Now type xtsum. The command for the FE estimator is xtreg dep x1 x2 x3,..., fe The command for the RE estimator is xtreg dep x1 x2 x3,..., re 39 / 40

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Econometrics Week 6 Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Fall 2012 1 / 21 Recommended Reading For the today Advanced Panel Data Methods. Chapter 14 (pp.