Topic 10: Panel Data Analysis

Similar documents
Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Panel Data Model (January 9, 2018)

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Applied Microeconometrics (L5): Panel Data-Basics

Topic 7: Heteroskedasticity

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

Econometrics of Panel Data

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

EC327: Advanced Econometrics, Spring 2007

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

Dealing With Endogeneity

Short T Panels - Review

Advanced Econometrics

Applied Quantitative Methods II

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

Econometrics of Panel Data

Econometrics of Panel Data

Fixed Effects Models for Panel Data. December 1, 2014

Econometrics of Panel Data

Econometrics of Panel Data

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

Capital humain, développement et migrations: approche macroéconomique (Empirical Analysis - Static Part)

10 Panel Data. Andrius Buteikis,

Econ 582 Fixed Effects Estimation of Panel Data

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Applied Economics. Panel Data. Department of Economics Universidad Carlos III de Madrid

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Lecture 9: Panel Data Model (Chapter 14, Wooldridge Textbook)

Lecture 4: Linear panel models

Notes on Panel Data and Fixed Effects models

Econometrics of Panel Data

Econometrics - 30C00200

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Instrumental Variables, Simultaneous and Systems of Equations

1 Estimation of Persistent Dynamic Panel Data. Motivation

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

y it = α i + β 0 ix it + ε it (0.1) The panel data estimators for the linear model are all standard, either the application of OLS or GLS.

Dynamic Panel Data Workshop. Yongcheol Shin, University of York University of Melbourne

Ordinary Least Squares Regression

Linear Panel Data Models

Econometrics. 7) Endogeneity

ECON 4551 Econometrics II Memorial University of Newfoundland. Panel Data Models. Adapted from Vera Tabakova s notes

A Course in Applied Econometrics Lecture 4: Linear Panel Data Models, II. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Non-linear panel data modeling

Chapter 15 Panel Data Models. Pooling Time-Series and Cross-Section Data

Basic Regressions and Panel Data in Stata

Econometrics. 8) Instrumental variables

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

1. The OLS Estimator. 1.1 Population model and notation

Efficiency of repeated-cross-section estimators in fixed-effects models

Intermediate Econometrics

Lecture 6: Dynamic panel models 1

HOW IS GENERALIZED LEAST SQUARES RELATED TO WITHIN AND BETWEEN ESTIMATORS IN UNBALANCED PANEL DATA?

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

1 The Multiple Regression Model: Freeing Up the Classical Assumptions

Panel data methods for policy analysis

Controlling for Time Invariant Heterogeneity

Dynamic Panel Data Models

MULTILEVEL MODELS WHERE THE RANDOM EFFECTS ARE CORRELATED WITH THE FIXED PREDICTORS

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Lecture 4: Heteroskedasticity

1 Motivation for Instrumental Variable (IV) Regression

Econ 836 Final Exam. 2 w N 2 u N 2. 2 v N

Multiple Equation GMM with Common Coefficients: Panel Data

Introduction to Estimation Methods for Time Series models. Lecture 1

Jeffrey M. Wooldridge Michigan State University

Applied Econometrics. Lecture 3: Introduction to Linear Panel Data Models

Week 2: Pooling Cross Section across Time (Wooldridge Chapter 13)

ADVANCED ECONOMETRICS I. Course Description. Contents - Theory 18/10/2017. Theory (1/3)

Econometric Analysis of Cross Section and Panel Data

Title. Description. Quick start. Menu. stata.com. xtcointtest Panel-data cointegration tests

Test of hypotheses with panel data

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

Introduction to Econometrics. Heteroskedasticity

Sensitivity of GLS estimators in random effects models

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Panel Data: Fixed and Random Effects

Instrumental Variables and the Problem of Endogeneity

Linear Regression with Time Series Data

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

Empirical Application of Panel Data Regression

Introduction to Linear Regression Analysis

Time-Series Cross-Section Analysis

ECON Introductory Econometrics. Lecture 13: Internal and external validity

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Graduate Econometrics Lecture 4: Heteroskedasticity

Chapter 2. Dynamic panel data models

Economics 582 Random Effects Estimation

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

Introduction to Econometrics

Non-Spherical Errors

Panel data can be defined as data that are collected as a cross section but then they are observed periodically.

Outline. Overview of Issues. Spatial Regression. Luc Anselin

PS 271B: Quantitative Methods II Lecture Notes

splm: econometric analysis of spatial panel data

Please discuss each of the 3 problems on a separate sheet of paper, not just on a separate page!

Lecture 8 Panel Data

Transcription:

Topic 10: Panel Data Analysis Advanced Econometrics (I) Dong Chen School of Economics, Peking University 1 Introduction Panel data combine the features of cross section data time series. Usually a panel data set have the following format. id time y x 1 1998 3.56 543 1 1999 3.98 438 1 2000 3.20 648 2 1998 2.48 393 2 1999 2.69 974 2 2000 3.01 753.. The rich information contained in a panel data set will allow us to answer some questions that can not be well answered using a cross section or time series data set alone. For example, in the study of FDI s impact on domestic firms productivity using industry-level cross section data, one faces the problem of separating out the potential effect of self-selection. Suppose one estimate a cross section model that regresses the domestic firms average productivity in an industry on the level of FDI presence along with a set of industry-specific controls using industry-level data. P ROD i = β 1 + β 2 F DI i + β 3 CONC i + β 4 KLRAT IO i + β 5 SOE i + ε i, (1) P ROD is industry average productivity, F DI is the level of FDI presence (measured by capital or employment), CON C is a measure market concentration, KLRAT IO is capital-labor ratio, SOE is a measure of state-owned enterprises weight in an industry. A positive coefficient estimate for FDI s presence in this case can bear two interpretations. One is that foreign firms presence raises domestic firms productivity, which suggests positive spillover effect of FDI. However, this may also reflect the fact that multinational firms are more likely to enter industries in which domestic firms have higher productivity in the first place. Without information on the time dimension, it is hard to make a distinction between these two effects. Panel data, however, will allow one to make inferences in cases like this. An important advantage of panel data is that they allow us to control for some unobserved cross-unit heterogeneity that does not vary over time. For 1

2 Independently Pooled Cross Sections 2 example, demographics civic culture are thought to change very slowly, especially compared to short term fluctuations in the economy or changes in the political climate. Consider the problem of estimating the effect of imprisonment on crime. The simple correlation between the per capita prison population property crimes per capita is positive. Controlling for demographic shortrun economic factors the relationship between imprisonment property crime remains positive. Does this mean imprisonment has no effect on curbing crime, or it s just because we have omitted some potential unmeasured factors, such as culture? Panel data will be helpful to answer questions like this. In this chapter, we are going to discuss some widely used models of panel data. Before our discussion of those models, we shall first examine a different, while related, type of data, namely, independently pooled cross sections. 2 Independently Pooled Cross Sections Independently pooled cross section data are composed of observations that are sampled romly from a large population at different points in time. It differs from panel data in that it contains different samples of units in each period instead of following the same units across time. As a result, sometimes independently pooled cross section data sets are also called pseudo-panels. Models using independently pooled cross section data set can be estimated by OLS. Due to increased sample size, one can obtain more precise estimates of parameters increased power in hypothesis testing. More importantly, adding time dimension to the data set can sometimes allow for correct inferences when otherwise impossible. Example 1: Immigration is an important part of Canada s public policy. The Canadian government is concerned about how immigrants are assimilated into Canadian society. One important measure of assimilation is to compare immigrants employment income with that of native born Canadians. A possible way of measuring the assimilation process is to regress people s job earnings on an immigrant dummy variable, a variable indicating the number of years since an immigrant led in Canada a set of individual specific characteristics like level of education, age, work experience etc. EARN i = β 1 + β 2 IMM i + β 3 Y EARIMM i + β 4 Y EARIMM 2 i +β 5 AGE i + β 6 SCHOOL i + ε i, (2) EARN is an individual s job earnings, IMM is a dummy variable indicating whether an individual is an immigrant, Y EARIMM is the number of years since immigration, AGE measures an individual s age SCHOOL is years of schooling. Suppose this model is estimated using cross section data, for example, a sample from the census of a certain year. To make correct inference, an important assumption is that some unobserved qualities of immigrants who have led in Canada in different years are constant. However, this assumption may well be violated due to the dramatic shift of immigrants source countries during the past several decades. Therefore, the observed positive correlation between immigrants earnings the length of time that they have spent in Canada may not reflect real assimilation, but is instead, at least partly, due to

3 Panel Data Models 3 the changes of the unobserved qualities among different immigrant cohorts. If, however, the estimation is based on pooled data from multiple censuses, then this cohort effect can be controlled. Several issues need to be considered when using pooled cross section data. First, usually time period dummy variables are included in the model to allow the intercept to differ. If the effect of an explanatory variable changes over time, then we should interact the time period dummies with that explanatory variable. Note that if we interact all explanatory variables with the time period dummies, then it is equivalent to running separate regressions for each time period. This calls for testing for structural changes in the model across time, for example, using Chow test as we have discussed earlier. 3 Panel Data Models The basic framework of panel data analysis is summarize by the following model. y it = x itβ + z iα + ε it, i = 1,..., n; t = 1,..., T. (3) There are K regressors in x it, not including a constant term. The term z i α measures the individual or heterogeneous effect, z i contains a constant term a set of individual or group specific variables that are constant over time t. Note that if z i is observed for all individuals, then this model is simply an regular linear regression model that can be estimated by OLS. Depending on the assumptions on z iα, we can have different models. Pooled Regression If z i contains only a constant term (i.e., no unobserved individual or group specific heterogeneity), then OLS will give consistent efficient estimates of the common intercept, α the slope vector β. Fixed Effects If z i is unobserved, at least partially, but correlated with x it, then OLS estimator of β will be biased inconsistent due to the omitted variable problem. In this case, the model becomes y it = x itβ + α i + ε it, (4) α i embodies all the unobservable individual or group specific effects. Rom Effects If the unobserved individual heterogeneity can be assumed to be uncorrelated with x it, then the model can be rewritten as y it = x itβ + E (z iα) + {z iα E (z iα)} + ε it = x itβ + α + u i + ε it, (5) u i is a group specific rom term. 4 Fixed Effects 4.1 Estimation Note that in equation (4), α i can be treated as individual specific intercept terms they can be estimated by including a set of dummy variables in the model. Thus, the fixed effect model can be written in a compact form as y = Dα + Xβ + ε, (6)

4 Fixed Effects 4 D = i 0 0 i 0. 0 0 i nt n α = α 1 α 2. α n n 1 This model is also called the least squares dummy variable (LSDV) model. Estimating (6) using OLS involves inverting a (n + K) (n + K) matrix. If n is large, which is usually the case, then this is likely to exceed the storage capacity of computers. There is an alternative way to proceed which only requires inverting a K K matrix. This is achieved by using the results of partitioned regression to first estimate β alone from model (6). The normal equations are (i) (ii) [ D D D X X D X X We can first solve for a from (i) in (7). 1 Substitute (9) into (ii) in (7), we have Solving for b yields b = ] [ a b ] = [ D y X y. ]. (7) a = (D D) 1 D y (D D) 1 D Xb (8) = (D D) 1 D (y Xb). (9) X D (D D) 1 D y X D (D D) 1 D Xb + X Xb = X y. [ ( X I D (D D) 1 D ) ] 1 [ ( X X I D (D D) 1 D ) ] y = [X M D X] 1 [X M D y], (10) M D = I D (DD) 1 D. Since M D is an idempotent matrix, we can rewrite (10) as b = [X M DM D X] 1 [X M DM D y] = (X X ) 1 X y, (11) X = M D X y = M D y. Note that the columns of the matrix D are orthogonal, so M 0 0 0 0 M 0 0 M D =., 0 M 0 1 The solution (8) implies an important result: if D X = 0, then bα = (D D) 1 D y, which is just the OLS estimator of regressing y on D.

4 Fixed Effects 5 M 0 = I T 1 T ii. Recall that the matrix M 0 creates deviations from the mean when postmultiplied by any T 1 vector z i (see the lecture notes on the derivation of R 2 ). That is, M 0 z i = z i zi. Therefore, the least squares regression of M D y on M D X is equivalent to a regression of (y it y i. ) on (x it x i. ), y i. x i. are the scalar K 1 vector of means of y it x it over the T observations for group i. That is, y it y i. = (x it x i. ) + (ε it ε i. ). (12) Hence, the fixed effects estimator b can also be written as [ n ] 1 [ T n ] T b = (x it x i. ) (x it x i. ) (x it x i. ) (y it y i. ). With the estimate for β, the estimate for α can be obtained from the other normal equation in the partitioned regression. a = (D D) 1 D (y Xb), (13) or a i = y i. x i.b. (14) Remark 1: In fixed effects models, explanatory variables that are constant over time cannot be included because in this case x it x i. = 0, i. Also, when a full set of time period dummy variables (or a linear time trend) is included, explanatory variables whose change across time is constant, e.g. age, cannot be included. Although time-invariant variables cannot be included by themselves in the fixed effects model, one can interact them with variables that change over time, for example, with time period dummy variables. Doing so will yield estimates of how the partial effect of that variable changes over time. Remark 2: A panel data set with missing values for some time periods is called an unbalanced panel. Generally, we can proceed as usual using the available data. Note that in this case the observations with only one period of data will not play a role in the estimation will be dropped because in these cases y it y i. = 0 x it x i. = 0. 4.2 Properties of the Fixed Effects Estimator If we assume rom sampling on the cross section dimension strict exogeneity on the time series dimension (conditional on the unobserved effects), E (ε it X i, α i ) = 0, (15)

4 Fixed Effects 6 then the fixed effects estimator of α β are unbiased. To see the unbiasedness, { } E (b) = E [X M D X] 1 [X M D y] { } = E [X M D X] 1 [X M D (Dα + Xβ + ε)] { } = E 0 + β + [X M D X] 1 X M D ε = β. The estimator of the covariance matrix for b is Est.Var (b) = s 2 (X M D X) 1 (16) [ n ] 1 T = s 2 (x it x i. ) (x it x i. ), (17) n T s 2 = e2 it nt n K. (18) The itth residual, e it, is defined as e it = y it x itb a i = y it x itb y i. + x i.b = (y it y i. ) (x it x i. ) b. (19) For the fixed effects estimator to be BLUE, we need to further assume homoskedasticity no autocorrelation. That is, for each t, for all t s, Var (ε it X i, α i ) = Var (ε it ) = σ 2 ε, (20) Cov (ε it, ε is X i, α i ) = 0. (21) The fixed effect estimator of β is consistent when either n or T or both tend to infinity. However, the estimator α is consistent only if T. STATA Tips To estimate fixed effects models in STATA, we need to first declare the data set as panel data by using the tsset comm. tsset id_var date_var, option The usual time series operators (lag, difference etc.) then can be applied to the panel data. For example, to create the first difference of the variable profit across years for each firm, you can type tsset firm year, yearly gen dprofit = d.profit A fixed effects model then can be estimated by using the xtreg comm with the fe option. xtreg dep_var var_list, fe Note that STATA can only process data arranged in long form rather than wide form. To transform a wide-form data set to long form, use the reshape comm. Check the help file for more details.

5 Rom Effects 7 5 Rom Effects 5.1 Assumptions If we assume α i is rom uncorrelated with X, then we can make more efficient use of the data by using the rom effects model. Consider a reformulation of the model y it = x itβ + (α + u i ) + ε it. (22) In this case we have K regressors a constant term α, which is the mean of the unobserved heterogeneity, E (z i α). The component u i is the rom heterogeneity specific to the ith observation is constant over time. It is further assumed that E (ε it X) = E (u i X) = 0, E ( ε 2 it X ) = σ 2 ε, E ( u 2 i X ) = σ 2 u, E (ε it u j X) = 0 for all i, t, j, E (ε it ε js X) = 0 if t s or i j, E (u i u j X) = 0 if i j. Let η it = u i + ε it, which is the composite error term. It follows that E (η it X) = 0, E ( η 2 it X ) = σ 2 u + σ 2 ε, E (η it η is X) = σ 2 u if t s, E (η it η js X) = 0 for all t s if i j. Denote η i = [η i1, η i2,..., η it ]. Let E (η i η i X) = Σ. Then σu 2 + σε 2 σu 2 σu 2 σu 2 σu 2 + σε 2 σu 2 Σ =... (23) σu 2 σu 2 σu 2 + σε 2 = σεi 2 T + σui 2 T i T, (24) i is a T 1 vector of 1s. Since observations i j are independent, the disturbance covariance matrix for the full nt observations is Σ 0 0 0 Σ 0 Ω =... = I n Σ. (25) 0 0 Σ 5.2 GLS Estimator Given the error structure of the rom effects model, OLS applied to model (22) will yield a consistent estimator of β, but it will not be efficient. To obtain

5 Rom Effects 8 efficient estimator, we shall use GLS method. The GLS estimator of the slope parameters is β = ( X Ω 1 X ) 1 X Ω 1 y ( n ) 1 ( n ) = X iω 1 X i X iω 1 X i. (26) i=1 The GLS method is equivalent to OLS on the transformed model i=1 y it θy i. = (1 θ) α + (x it θx i. ) β+ (η it θη i. ), (27) θ = 1 σ ε. (28) σ 2 ε + T σu 2 These transformations are known as the quasi-demeaned data because they are formed by subtracting only a fraction of the averages. Note the similarity of this procedure to the computation in the fixed effects model, which uses θ = 1. Unlike the fixed effects models, it is possible to include explanatory variables that are constant over time in the rom effects model. 5.3 Feasible GLS Since the variance components, σ 2 u σ 2 ε, are usually unknown, we can use the two-step method to obtain the FGLS estimator. In the first step, we estimate the variance components using some consistent estimators in the second step, we substitute those values into the GLS estimator. Specifically, consider Taking the difference yields y it = x itβ + α + ε it + u i (29) y i. = x itβ + α + ε i. + u i. (30) y it y i. = (x it x i. ) β + (ε it ε i. ). (31) The OLS estimator of β from model (31) is just the LSDV estimator, which is unbiased consistent. We can estimate σε 2 by T s 2 t=1 e = (e it e i. ) 2, (32) T K 1 e it is given in (19). average obtain s 2 e = 1 n There are n such estimators, so we can take the i=1 n T s 2 e = (e it e i. ) 2. nt nk n However, since α β are not estimated n times, the above expression makes excess correction for the degrees of freedom. It can be shown that an unbiased estimator for σε 2 is n T s 2 LSDV = (e it e i. ) 2. (33) nt n K

6 Testing for Rom Effects 9 Note that estimating (29) by pooled OLS will give consistent estimators of α β. Hence, a consistent estimator of E ( η 2 it) is That is, s 2 P ooled = Therefore, we can estimate σ 2 u by e e nt K 1. (34) plims 2 P ooled = σ 2 u + σ 2 ε. σ 2 u = s 2 P ooled s 2 LSDV. (35) Plugging σ 2 ε σ 2 u into (28), we can obtain an estimator of θ. When the sample size is large (in the sense that either n or T or both), the FGLS estimator is asymptotically as efficient as the true GLS estimator. Even for moderate sample size, the FGLS is still more efficient than the fixed effects estimator. 6 Testing for Rom Effects If the regressors are correlated with the rom effects α i, then the GLS estimator of β is inconsistent. If that is the case, then we shall use the fixed effects model, which always yields consistent estimators. Otherwise, rom effects models are more efficient. It is possible to test for such orthogonality by using Hausman s specification test. H 0 : Cov (x itj, α i ) = 0 vs. H 1 : Cov (x itj, α i ) 0. The test statistic is W = ( b β) Ψ 1 ( b β) a χ 2 K, (36) b is the fixed effects estimator, ) β is the rom effects estimator, Ψ = Est.Var (b) Est.Var ( β. STATA Tips To estimate the rom effects model in STATA, use the xtreg comm with the re option. xtreg dep_var varlist, re The Hausman s test can be carried out using the following comms. quiet xtreg dep_var var_list, fe estimates store fixed quiet xtreg dep_var var_list, re hausman fixed

7 Comparison of OLS, Fixed Effects, Rom Effects 10 7 Comparison of OLS, Fixed Effects, Rom Effects We can formulate the fixed effects panel data regression model in three ways. First, the original model is In terms of deviations from the group means, in terms of the group means y it = x itβ + α + ε it. (37) y it y i. = (x it x i. ) β + (ε it ε i. ), (38) y i. = x i.β + α + ε i.. (39) In (37), the matrices of the sums of squares cross products around the overall means are S total = S total = T ( xit x ) ( x it x ), (40) T ( xit x ) ( y it y ). In (38), the matrices are around the group means are given by S within = S within = T (x it x i. ) (x it x i. ), (41) T (x it x i. ) (y it y i. ). (42) These are the averages of the variations within groups. Finally, for (39), the moment matrices are the between-groups sums of squares cross products. S between = T ( x i. x ) ( x i. x ) i=1 (43) It can be shown that S between = S total T ( x i. x ) ( y i. y ). (44) i=1 = S within + S between, (45) S total = S within + S between. (46)

7 Comparison of OLS, Fixed Effects, Rom Effects 11 As a results, the OLS estimator with pooled data (without exploring the panel feature of the data set) is b total = ( S total ) 1 S total = ( S within + S between ) 1 ( S within + S between ). (47) The within-group estimator, which is also the LSDV or the fixed effects estimator, is given by b within = ( S within ) 1 S within. (48) And finally the between-groups estimator is From (48) (49), we have b between = ( S between ) 1 S between. (49) S within S between Substituting (50) (51) into (47), we have = S within b within, (50) = S between b between. (51) b total = F within b within + F between b between, (52) F within = ( S within + S between ) 1 S within = I F between. This result implies that the slope in the pooled data (total) will be a weighted average of the average slope within groups the slope of the means between groups. The rom effects model can also be compared within this framework. Let F within = ( S within σ 2 ε + λs between ) 1 S within, λ = σε 2 + T σu 2 = (1 θ) 2. If λ = 1 (i.e., σ 2 u = 0), then OLS is efficient. However, to the extent that λ is less than 1, OLS will be inefficient because it gives too much weight to the between-group variation.