Topic 10: Panel Data Analysis Advanced Econometrics (I) Dong Chen School of Economics, Peking University 1 Introduction Panel data combine the features of cross section data time series. Usually a panel data set have the following format. id time y x 1 1998 3.56 543 1 1999 3.98 438 1 2000 3.20 648 2 1998 2.48 393 2 1999 2.69 974 2 2000 3.01 753.. The rich information contained in a panel data set will allow us to answer some questions that can not be well answered using a cross section or time series data set alone. For example, in the study of FDI s impact on domestic firms productivity using industry-level cross section data, one faces the problem of separating out the potential effect of self-selection. Suppose one estimate a cross section model that regresses the domestic firms average productivity in an industry on the level of FDI presence along with a set of industry-specific controls using industry-level data. P ROD i = β 1 + β 2 F DI i + β 3 CONC i + β 4 KLRAT IO i + β 5 SOE i + ε i, (1) P ROD is industry average productivity, F DI is the level of FDI presence (measured by capital or employment), CON C is a measure market concentration, KLRAT IO is capital-labor ratio, SOE is a measure of state-owned enterprises weight in an industry. A positive coefficient estimate for FDI s presence in this case can bear two interpretations. One is that foreign firms presence raises domestic firms productivity, which suggests positive spillover effect of FDI. However, this may also reflect the fact that multinational firms are more likely to enter industries in which domestic firms have higher productivity in the first place. Without information on the time dimension, it is hard to make a distinction between these two effects. Panel data, however, will allow one to make inferences in cases like this. An important advantage of panel data is that they allow us to control for some unobserved cross-unit heterogeneity that does not vary over time. For 1
2 Independently Pooled Cross Sections 2 example, demographics civic culture are thought to change very slowly, especially compared to short term fluctuations in the economy or changes in the political climate. Consider the problem of estimating the effect of imprisonment on crime. The simple correlation between the per capita prison population property crimes per capita is positive. Controlling for demographic shortrun economic factors the relationship between imprisonment property crime remains positive. Does this mean imprisonment has no effect on curbing crime, or it s just because we have omitted some potential unmeasured factors, such as culture? Panel data will be helpful to answer questions like this. In this chapter, we are going to discuss some widely used models of panel data. Before our discussion of those models, we shall first examine a different, while related, type of data, namely, independently pooled cross sections. 2 Independently Pooled Cross Sections Independently pooled cross section data are composed of observations that are sampled romly from a large population at different points in time. It differs from panel data in that it contains different samples of units in each period instead of following the same units across time. As a result, sometimes independently pooled cross section data sets are also called pseudo-panels. Models using independently pooled cross section data set can be estimated by OLS. Due to increased sample size, one can obtain more precise estimates of parameters increased power in hypothesis testing. More importantly, adding time dimension to the data set can sometimes allow for correct inferences when otherwise impossible. Example 1: Immigration is an important part of Canada s public policy. The Canadian government is concerned about how immigrants are assimilated into Canadian society. One important measure of assimilation is to compare immigrants employment income with that of native born Canadians. A possible way of measuring the assimilation process is to regress people s job earnings on an immigrant dummy variable, a variable indicating the number of years since an immigrant led in Canada a set of individual specific characteristics like level of education, age, work experience etc. EARN i = β 1 + β 2 IMM i + β 3 Y EARIMM i + β 4 Y EARIMM 2 i +β 5 AGE i + β 6 SCHOOL i + ε i, (2) EARN is an individual s job earnings, IMM is a dummy variable indicating whether an individual is an immigrant, Y EARIMM is the number of years since immigration, AGE measures an individual s age SCHOOL is years of schooling. Suppose this model is estimated using cross section data, for example, a sample from the census of a certain year. To make correct inference, an important assumption is that some unobserved qualities of immigrants who have led in Canada in different years are constant. However, this assumption may well be violated due to the dramatic shift of immigrants source countries during the past several decades. Therefore, the observed positive correlation between immigrants earnings the length of time that they have spent in Canada may not reflect real assimilation, but is instead, at least partly, due to
3 Panel Data Models 3 the changes of the unobserved qualities among different immigrant cohorts. If, however, the estimation is based on pooled data from multiple censuses, then this cohort effect can be controlled. Several issues need to be considered when using pooled cross section data. First, usually time period dummy variables are included in the model to allow the intercept to differ. If the effect of an explanatory variable changes over time, then we should interact the time period dummies with that explanatory variable. Note that if we interact all explanatory variables with the time period dummies, then it is equivalent to running separate regressions for each time period. This calls for testing for structural changes in the model across time, for example, using Chow test as we have discussed earlier. 3 Panel Data Models The basic framework of panel data analysis is summarize by the following model. y it = x itβ + z iα + ε it, i = 1,..., n; t = 1,..., T. (3) There are K regressors in x it, not including a constant term. The term z i α measures the individual or heterogeneous effect, z i contains a constant term a set of individual or group specific variables that are constant over time t. Note that if z i is observed for all individuals, then this model is simply an regular linear regression model that can be estimated by OLS. Depending on the assumptions on z iα, we can have different models. Pooled Regression If z i contains only a constant term (i.e., no unobserved individual or group specific heterogeneity), then OLS will give consistent efficient estimates of the common intercept, α the slope vector β. Fixed Effects If z i is unobserved, at least partially, but correlated with x it, then OLS estimator of β will be biased inconsistent due to the omitted variable problem. In this case, the model becomes y it = x itβ + α i + ε it, (4) α i embodies all the unobservable individual or group specific effects. Rom Effects If the unobserved individual heterogeneity can be assumed to be uncorrelated with x it, then the model can be rewritten as y it = x itβ + E (z iα) + {z iα E (z iα)} + ε it = x itβ + α + u i + ε it, (5) u i is a group specific rom term. 4 Fixed Effects 4.1 Estimation Note that in equation (4), α i can be treated as individual specific intercept terms they can be estimated by including a set of dummy variables in the model. Thus, the fixed effect model can be written in a compact form as y = Dα + Xβ + ε, (6)
4 Fixed Effects 4 D = i 0 0 i 0. 0 0 i nt n α = α 1 α 2. α n n 1 This model is also called the least squares dummy variable (LSDV) model. Estimating (6) using OLS involves inverting a (n + K) (n + K) matrix. If n is large, which is usually the case, then this is likely to exceed the storage capacity of computers. There is an alternative way to proceed which only requires inverting a K K matrix. This is achieved by using the results of partitioned regression to first estimate β alone from model (6). The normal equations are (i) (ii) [ D D D X X D X X We can first solve for a from (i) in (7). 1 Substitute (9) into (ii) in (7), we have Solving for b yields b = ] [ a b ] = [ D y X y. ]. (7) a = (D D) 1 D y (D D) 1 D Xb (8) = (D D) 1 D (y Xb). (9) X D (D D) 1 D y X D (D D) 1 D Xb + X Xb = X y. [ ( X I D (D D) 1 D ) ] 1 [ ( X X I D (D D) 1 D ) ] y = [X M D X] 1 [X M D y], (10) M D = I D (DD) 1 D. Since M D is an idempotent matrix, we can rewrite (10) as b = [X M DM D X] 1 [X M DM D y] = (X X ) 1 X y, (11) X = M D X y = M D y. Note that the columns of the matrix D are orthogonal, so M 0 0 0 0 M 0 0 M D =., 0 M 0 1 The solution (8) implies an important result: if D X = 0, then bα = (D D) 1 D y, which is just the OLS estimator of regressing y on D.
4 Fixed Effects 5 M 0 = I T 1 T ii. Recall that the matrix M 0 creates deviations from the mean when postmultiplied by any T 1 vector z i (see the lecture notes on the derivation of R 2 ). That is, M 0 z i = z i zi. Therefore, the least squares regression of M D y on M D X is equivalent to a regression of (y it y i. ) on (x it x i. ), y i. x i. are the scalar K 1 vector of means of y it x it over the T observations for group i. That is, y it y i. = (x it x i. ) + (ε it ε i. ). (12) Hence, the fixed effects estimator b can also be written as [ n ] 1 [ T n ] T b = (x it x i. ) (x it x i. ) (x it x i. ) (y it y i. ). With the estimate for β, the estimate for α can be obtained from the other normal equation in the partitioned regression. a = (D D) 1 D (y Xb), (13) or a i = y i. x i.b. (14) Remark 1: In fixed effects models, explanatory variables that are constant over time cannot be included because in this case x it x i. = 0, i. Also, when a full set of time period dummy variables (or a linear time trend) is included, explanatory variables whose change across time is constant, e.g. age, cannot be included. Although time-invariant variables cannot be included by themselves in the fixed effects model, one can interact them with variables that change over time, for example, with time period dummy variables. Doing so will yield estimates of how the partial effect of that variable changes over time. Remark 2: A panel data set with missing values for some time periods is called an unbalanced panel. Generally, we can proceed as usual using the available data. Note that in this case the observations with only one period of data will not play a role in the estimation will be dropped because in these cases y it y i. = 0 x it x i. = 0. 4.2 Properties of the Fixed Effects Estimator If we assume rom sampling on the cross section dimension strict exogeneity on the time series dimension (conditional on the unobserved effects), E (ε it X i, α i ) = 0, (15)
4 Fixed Effects 6 then the fixed effects estimator of α β are unbiased. To see the unbiasedness, { } E (b) = E [X M D X] 1 [X M D y] { } = E [X M D X] 1 [X M D (Dα + Xβ + ε)] { } = E 0 + β + [X M D X] 1 X M D ε = β. The estimator of the covariance matrix for b is Est.Var (b) = s 2 (X M D X) 1 (16) [ n ] 1 T = s 2 (x it x i. ) (x it x i. ), (17) n T s 2 = e2 it nt n K. (18) The itth residual, e it, is defined as e it = y it x itb a i = y it x itb y i. + x i.b = (y it y i. ) (x it x i. ) b. (19) For the fixed effects estimator to be BLUE, we need to further assume homoskedasticity no autocorrelation. That is, for each t, for all t s, Var (ε it X i, α i ) = Var (ε it ) = σ 2 ε, (20) Cov (ε it, ε is X i, α i ) = 0. (21) The fixed effect estimator of β is consistent when either n or T or both tend to infinity. However, the estimator α is consistent only if T. STATA Tips To estimate fixed effects models in STATA, we need to first declare the data set as panel data by using the tsset comm. tsset id_var date_var, option The usual time series operators (lag, difference etc.) then can be applied to the panel data. For example, to create the first difference of the variable profit across years for each firm, you can type tsset firm year, yearly gen dprofit = d.profit A fixed effects model then can be estimated by using the xtreg comm with the fe option. xtreg dep_var var_list, fe Note that STATA can only process data arranged in long form rather than wide form. To transform a wide-form data set to long form, use the reshape comm. Check the help file for more details.
5 Rom Effects 7 5 Rom Effects 5.1 Assumptions If we assume α i is rom uncorrelated with X, then we can make more efficient use of the data by using the rom effects model. Consider a reformulation of the model y it = x itβ + (α + u i ) + ε it. (22) In this case we have K regressors a constant term α, which is the mean of the unobserved heterogeneity, E (z i α). The component u i is the rom heterogeneity specific to the ith observation is constant over time. It is further assumed that E (ε it X) = E (u i X) = 0, E ( ε 2 it X ) = σ 2 ε, E ( u 2 i X ) = σ 2 u, E (ε it u j X) = 0 for all i, t, j, E (ε it ε js X) = 0 if t s or i j, E (u i u j X) = 0 if i j. Let η it = u i + ε it, which is the composite error term. It follows that E (η it X) = 0, E ( η 2 it X ) = σ 2 u + σ 2 ε, E (η it η is X) = σ 2 u if t s, E (η it η js X) = 0 for all t s if i j. Denote η i = [η i1, η i2,..., η it ]. Let E (η i η i X) = Σ. Then σu 2 + σε 2 σu 2 σu 2 σu 2 σu 2 + σε 2 σu 2 Σ =... (23) σu 2 σu 2 σu 2 + σε 2 = σεi 2 T + σui 2 T i T, (24) i is a T 1 vector of 1s. Since observations i j are independent, the disturbance covariance matrix for the full nt observations is Σ 0 0 0 Σ 0 Ω =... = I n Σ. (25) 0 0 Σ 5.2 GLS Estimator Given the error structure of the rom effects model, OLS applied to model (22) will yield a consistent estimator of β, but it will not be efficient. To obtain
5 Rom Effects 8 efficient estimator, we shall use GLS method. The GLS estimator of the slope parameters is β = ( X Ω 1 X ) 1 X Ω 1 y ( n ) 1 ( n ) = X iω 1 X i X iω 1 X i. (26) i=1 The GLS method is equivalent to OLS on the transformed model i=1 y it θy i. = (1 θ) α + (x it θx i. ) β+ (η it θη i. ), (27) θ = 1 σ ε. (28) σ 2 ε + T σu 2 These transformations are known as the quasi-demeaned data because they are formed by subtracting only a fraction of the averages. Note the similarity of this procedure to the computation in the fixed effects model, which uses θ = 1. Unlike the fixed effects models, it is possible to include explanatory variables that are constant over time in the rom effects model. 5.3 Feasible GLS Since the variance components, σ 2 u σ 2 ε, are usually unknown, we can use the two-step method to obtain the FGLS estimator. In the first step, we estimate the variance components using some consistent estimators in the second step, we substitute those values into the GLS estimator. Specifically, consider Taking the difference yields y it = x itβ + α + ε it + u i (29) y i. = x itβ + α + ε i. + u i. (30) y it y i. = (x it x i. ) β + (ε it ε i. ). (31) The OLS estimator of β from model (31) is just the LSDV estimator, which is unbiased consistent. We can estimate σε 2 by T s 2 t=1 e = (e it e i. ) 2, (32) T K 1 e it is given in (19). average obtain s 2 e = 1 n There are n such estimators, so we can take the i=1 n T s 2 e = (e it e i. ) 2. nt nk n However, since α β are not estimated n times, the above expression makes excess correction for the degrees of freedom. It can be shown that an unbiased estimator for σε 2 is n T s 2 LSDV = (e it e i. ) 2. (33) nt n K
6 Testing for Rom Effects 9 Note that estimating (29) by pooled OLS will give consistent estimators of α β. Hence, a consistent estimator of E ( η 2 it) is That is, s 2 P ooled = Therefore, we can estimate σ 2 u by e e nt K 1. (34) plims 2 P ooled = σ 2 u + σ 2 ε. σ 2 u = s 2 P ooled s 2 LSDV. (35) Plugging σ 2 ε σ 2 u into (28), we can obtain an estimator of θ. When the sample size is large (in the sense that either n or T or both), the FGLS estimator is asymptotically as efficient as the true GLS estimator. Even for moderate sample size, the FGLS is still more efficient than the fixed effects estimator. 6 Testing for Rom Effects If the regressors are correlated with the rom effects α i, then the GLS estimator of β is inconsistent. If that is the case, then we shall use the fixed effects model, which always yields consistent estimators. Otherwise, rom effects models are more efficient. It is possible to test for such orthogonality by using Hausman s specification test. H 0 : Cov (x itj, α i ) = 0 vs. H 1 : Cov (x itj, α i ) 0. The test statistic is W = ( b β) Ψ 1 ( b β) a χ 2 K, (36) b is the fixed effects estimator, ) β is the rom effects estimator, Ψ = Est.Var (b) Est.Var ( β. STATA Tips To estimate the rom effects model in STATA, use the xtreg comm with the re option. xtreg dep_var varlist, re The Hausman s test can be carried out using the following comms. quiet xtreg dep_var var_list, fe estimates store fixed quiet xtreg dep_var var_list, re hausman fixed
7 Comparison of OLS, Fixed Effects, Rom Effects 10 7 Comparison of OLS, Fixed Effects, Rom Effects We can formulate the fixed effects panel data regression model in three ways. First, the original model is In terms of deviations from the group means, in terms of the group means y it = x itβ + α + ε it. (37) y it y i. = (x it x i. ) β + (ε it ε i. ), (38) y i. = x i.β + α + ε i.. (39) In (37), the matrices of the sums of squares cross products around the overall means are S total = S total = T ( xit x ) ( x it x ), (40) T ( xit x ) ( y it y ). In (38), the matrices are around the group means are given by S within = S within = T (x it x i. ) (x it x i. ), (41) T (x it x i. ) (y it y i. ). (42) These are the averages of the variations within groups. Finally, for (39), the moment matrices are the between-groups sums of squares cross products. S between = T ( x i. x ) ( x i. x ) i=1 (43) It can be shown that S between = S total T ( x i. x ) ( y i. y ). (44) i=1 = S within + S between, (45) S total = S within + S between. (46)
7 Comparison of OLS, Fixed Effects, Rom Effects 11 As a results, the OLS estimator with pooled data (without exploring the panel feature of the data set) is b total = ( S total ) 1 S total = ( S within + S between ) 1 ( S within + S between ). (47) The within-group estimator, which is also the LSDV or the fixed effects estimator, is given by b within = ( S within ) 1 S within. (48) And finally the between-groups estimator is From (48) (49), we have b between = ( S between ) 1 S between. (49) S within S between Substituting (50) (51) into (47), we have = S within b within, (50) = S between b between. (51) b total = F within b within + F between b between, (52) F within = ( S within + S between ) 1 S within = I F between. This result implies that the slope in the pooled data (total) will be a weighted average of the average slope within groups the slope of the means between groups. The rom effects model can also be compared within this framework. Let F within = ( S within σ 2 ε + λs between ) 1 S within, λ = σε 2 + T σu 2 = (1 θ) 2. If λ = 1 (i.e., σ 2 u = 0), then OLS is efficient. However, to the extent that λ is less than 1, OLS will be inefficient because it gives too much weight to the between-group variation.