Panel Data Econometrics

Size: px

Start display at page:

Download "Panel Data Econometrics"

Alberta Matthews
5 years ago
Views:

1 Panel Data Econometrics RDB September 2012 Contents 1 Advantages of Panel Longitudinal) Data 5 I Preliminaries 5 2 Some Common Estimators 5 21 Ordinary Least Squares OLS) Small Sample Properties Large Sample Properties 7 22 Instrumental Variables IV ) 8 23 Generalized Least Squares GLS) 9 24 Generalized) Method of Moments GMM ) Maximum Likelihood ML) Estimator Regularity conditions Properties Maximum Entropy ME) Entropy Maximum entropy principle Use 17 1

2 3 Some Common Tests Likelihood ratio test Lagrange multiplier LM ) test Wald test Hausman Test Sargan-Hansen J -Test 20 4 Models, parameters of interest and the incidental parameters problem Review of some nonlinear models Discrete choice Censoring Selection Count data Policy parameters The incidental parameter problem Terminology 29 II Linear Models 29 5 Static Uncorrelated individual effects OLS GLS FGLS Between Groups Estimator Correlated individual effects Within estimator or Least Squares Dummy Variables LSDV ) estimation First differenced OLS Hausman-Taylor IV Chamberlain device Hausman test 36 2

3 6 Dynamic No exogenous regressors Previous Estimators IV GMM Exogenous regressors Moment conditions for the equation in differences Moment conditions for the equation in levels Serially correlated errors MA errors AR errors Testing for serial correlation 57 7 Random Coefficient Models General Correlated Random Coefficient Models Random Coefficient Dynamic Models* 70 III Non-Linear Models 70 8 Static models Random effects Butler-Moffit Simulation based estimation* Correlated RE Uncorrelated RE Probit Correlated RE Probit Fixed effects Conditional MLE Maximum score* Censored regression 79 3

4 824 Fixed T identifiability Bias reduction Orthogonal Parameters* 82 9 Dynamic models The initial conditions problem Dynamic discrete choice and duration Random Effects Tobit Fixed effects 85 IV Additional Issues Cross-sectional dependence* Variance estimation One-way clustering More-way clustering QML Attrition - Sample selection Simple panel data model of non-response Identification Two-step estimation - RE ML estimation of a RE model Estimation of a FE model 97 4

5 1 Advantages of Panel Longitudinal) Data Panel data: 1 overcome the shortcomings of both cross-sectional and time-series data a) Cross-sectional data do not provide dynamic information b) Time-series variables tend to move col-linearly It is difficult to separate microfrom macrodynamic effects Estimation of distributed lag models relies on strong assumptions without firm empirical justification 2 allow for more complicated behavioral hypotheses 3 allow to control the effects of missing or unobserved variables 4 provide micro foundations for aggregate data analysis 5 can simplify inference if observations are IID across cross-sectional units Part I Preliminaries 2 Some Common Estimators 21 Ordinary Least Squares OLS) Consider the general linear model Y = Xβ + V, 1) 5

6 where Y = y 1 y 2 ; X = x 1 x 2 ; V = v 1 v 2 y N x N v N The OLS estimators of the unknown parameters β and σ 2 = Var [v] are defined as ˆβ = arg min β Y Xβ) Y Xβ), which can in the case of a linear model) be rewritten as ˆβ = X X) 1 X Y, and s 2 = N K) 1 ˆV ˆV 211 Small Sample Properties Consider now the following assumption about 1) Assumption LS1 FR) The N K matrix X is of rank K Assumption LS2 MI) The disturbances V are mean-independent of X E [v i x i ] = 0 Assumption LS3 SD) The disturbances are homoskedastic and not autocorrelated E [V V X] = σ 2 I N Assumption LS4 NS) The matrix X is non-stochastic 6

7 Under Assumptions LS1)-LS3), the OLS estimators have the following exact finite sample properties 1 Unbiasedness [ ] E ˆβ X [ ] = E ˆβ = β E [ s 2 X ] = E [ s 2] = σ 2 2 Gauss-Markov Theorem ˆβ is the best linear unbiased estimator of β, with [ ] Var ˆβ X [ = σ 2 E X X) 1] The following assumption allows inference in small samples Assumption LS5 SD) The disturbances are normally distributed V X N 0; σ 2 I N ) Under assumptions LS1)-LS5) it holds that ˆβ X N 0; σ 2 X X) 1) 212 Large Sample Properties To obtain asymptotic properties, the matrix of explanatory variables needs to be well behaved and the condition plimn X X = Q xx, 2) N with Q xx a finite, positive definite matrix, is usually imposed This assumption, however, is stronger than necessary and can be replaced by the Grenander conditions in case X includes polynomial time series or trending variables We also replace assumption LS4) about nonstochastic X with the following one Assumption LS6 IS) The sequence {x i, v i )} N i=1 is IID 7

8 Under the regularity condition 2), and under assumptions LS2)-LS3) and LS6), it holds that plim ˆβ = β N ) D N ˆβ β N ) 0; σ 2 Q 1 xx The most important conclusion is that, given good behavior of the regressors, asymptotic normality of the regressors does not depend on the normality of the disturbances 22 Instrumental Variables IV ) Zero correlation between x i and v i was the key assumption to guarantee the desirable properties of the OLS estimator Here this assumption is relaxed, but the existence of an L 1 vector z i of variables that are correlated with x i, but not with v i, is assumed In particular, the following assumptions are made Assumption IV1 The sequence {x i, z i, v i )} N i=1 is IID Assumption IV2 The disturbances V are mean-independent of Z E [v i z i ] = 0 Under the additional regularity conditions plimn Z Z = Q zz, N plimn Z X = Q zx, N with Q zz and Q zx finite, positive definite matrices, the instrumental variable estimator ˆβ IV = X Z Z Z) 1 Z X) 1 X Z Z Z) 1 Z Y, 8

9 has the following asymptotic properties plim ˆβ IV = β N D N ˆβIV β) N ) 0; σ 2 Q 1 xz Q zz Q 1 zx 23 Generalized Least Squares GLS) Consider again the general linear model Y = Xβ + V, but assume now that the disturbances violate the homoskedasticity and uncorrelatedness assumptions, ie E [V V X] = σ 2 Ω, where Ω is a positive definite matrix If now both plimn 1 X X and plimn 1 X ΩX are positive definite matrices, then N N plim ˆβ OLS = β N If Ω would have been known, the best linear unbiased estimator BLUE) of α is given by ˆβ GLS = {X Ω 1 X} 1 {X Ω 1 Y } We can now first obtain ˆβ OLS, and then, using the OLS residuals, we can estimate the unknown parameters of Ω We can then decompose Ω into its eigenvectors and eigenvalues Ω = EΛE, where the columns of E are the characteristic vectors of Ω, and Λ is a diagonal matrix containing the characteristic roots λ i of Ω Let now Λ 1 /2 be the diagonal matrix containing λ i and define P = Λ 1/2 E Pre-multiplication of the model by P P Y = P Xβ + P V, 9

10 results now in a transformed model for which the disturbances are again homoskedastic and not autocorrelated E [P V V P X] = σ 2 P ΩP = σ 2 Λ 1/2 E EΛE EΛ 1/2 = σ 2 I, so the classical regression model applies to the transformed model and thus ˆβ GLS = {X P P X} 1 {X P P Y } = { X Ω 1 X } 1 { X Ω 1 Y } D N β; σ 2 X Ω 1 X ) ) 1 The feasible GLS estimator substitutes ˆΩ for the unknown Ω Remark that we only need a consistent estimator of Ω in order that the efficiency properties of ˆβ GLS carry over to ˆβ F GLS 24 Generalized) Method of Moments GMM ) Using GMM, a number of orthogonality restrictions are formulated The parameter estimates satisfy these restrictions as close as possible Consider the model y i = f x i, β) + u i z i = g x i ) E [z i u i ] = 0, where β is a K 1 vector of parameters and z i is a L 1 vector of instruments The corresponding population moment restrictions are given by N c N b) = N 1 z i {y i f x i, b)} = 0 i=1 10

11 In more general terms, the moment equations have the general form E [ψ w, β)] = 0, and their sample counterparts are given by c N b) = N 1 N i=1 ψ w i, b), with w i = y i, x i, z i) with double entries removed The GMM estimator of β is defined by ˆβ GMM = arg min J N b) b = arg min b c N b) W N c N b), where W N is some weight matrix In fact, a set of moment conditions defines a whole family of estimators ˆβ GMM W N ), depending on the actual choice of the weight matrix W N When K = L, the system is just identified and ˆβ GMM is the unique solution to c N b) = 0 Properties Under certain conditions Hansen, 1982), we have: 1 Strong consistency ˆβ GMM as β, for N ; 2 Asymptotic normality N ˆβGMM β) D N 0; Σ N ˆβGMM ), where Σ N ˆβ GMM = N D W D) 1 DW SW D D W D) 1 ) c N b) D β) = plim N b b=β W = plimw N N N S = N 1 E [ z i u 2 i z i] i=1 11

12 In the special case that f x i, β) is linear in β, there is a closed form expression for ˆβ GMM ˆβ GMM = X ZW N Z X) 1 X ZW N Z Y, and D = Z X Choice of W N 1 The asymptotically optimal GMM estimator minimizes Σ ˆβGMM in function of W N It can be shown that this optimum is reached for any choice of W N for which W = S 1, resulting in an asymptotic covariance for ˆβ OGMM of Σ N ˆβ OGMM = N D S 1 D ) 1 This optimal estimator is typically estimated in two steps: in a first step, a consistent estimate of β is obtained by OLS, IV, GMM, ), which allows estimation of S 1 by Ŝ 1 = N 1 N i=1 z i u 2 i z i) 1 In the second step the asymptotically optimal GMM estimator is computed While these asymptotic results only require the first step estimator to be consistent, finite sample properties improve when the first step weight matrix is closer to the optimal one 2 OLS is identical to exactly identified GMM with W N = I N and z i = x i 3 IV is a GMM estimator with W N = Z Z) 1 a weight matrix that is optimal in case that u i IID 0; σ 2 ) 25 Maximum Likelihood ML) 251 Estimator Suppose the RVs x 1,, x N have a joint density function f x 1,, x N θ), which can also be considered a function of the parameter vector θ In this capacity it is called the likelihood L θ) = f x 1,, x N θ) 12

13 The maximum likelihood estimator MLE) is that value of θ that makes the observed data most probable ˆθ = arg max L θ) θ If the x i are assumed to be IID, their joint density is the product of the marginal densities and the likelihood simplifies to L θ) = N f x i θ), i=1 maximization of which is usually simplified by taking the natural logarithm Since ln ) is a monotonous function, the maximum likelihood estimator also maximizes the log-likelihood l θ) = N ln f x i θ) i=1 The information matrix of θ is defined as [ ] l θ) I θ) = E θ θ Under correct specification of the model it is equal to E [ ) ) ] lθ) lθ) θ θ 252 Regularity conditions This informal treatment of conditions is from Greene 2000, p127) 1 The first three derivatives of l x i, θ) wrt θ are finite for all values of θ and for almost all x i, which ensures the finite variance 1 of lθ) / θ 2 The conditions for the existence of E [ lθ) / θ ] and E [I θ)] are met 3 θ, 3 lx i,θ)/ θ k θ l θ m is less than some function g x i ) with finite expectation By means of the above conditions, following properties are obtained: 1 In addition to the existence of a second degree Taylor approximation of l x i, θ) 13

14 1 l x i, θ), g i = lx i,θ)/ θ and H i = 2 lx i,θ), for i = 1,, N are all randomly sampled of θ θ random variables 2 E [g i ] = 0 3 Var [g i ] = E [H i ] These should facilitate the intuitions behind following properties 253 Properties Under the regularity conditions stated below 2, the MLE has following asymptotic properties 1 Consistency plim ˆθ ML = θ N 2 Normality ˆθ ML D N 0; I θ) 1), with [ ] [ ] 2 ln L θ) 2 l θ) I θ) = E = E θ θ θ θ 3 Efficiency: ˆθ ML reaches the Cramér-Rao lower bound for consistent estimators ) 4 Invariance: the MLE of γ = g θ) is given by g ˆθML 26 Maximum Entropy ME) 261 Entropy Setting Consider an experiment with the possible outcomes y 1,, y N, from which we want to retrieve the unknown and unobservable probabilities p 1,, p K, K k=1 p k = 1, p k 0, 2 See sub-subsection

15 which are assumed to represent the GDP We assume, in addition, that y = Xp, where X is an N K) non-invertible matrix with K > N Golan, Judge, and Miller, 1996) Prior to execution, there is uncertainty about the outcome Entropy is a measure of this uncertainty It should satisfy following requirements Kapur, 1989, p2): 1 It should be a continuous function of p 1,, p K, invariant to permutation of its arguments E K p 1,, p K ) 2 Addition of an impossible outcome E K+1 p 1,, p K, 0) = E K p 1,, p K ) 3 Certain outcome E K 1, 0 K 1 ) = 0 4 Maximum uncertainty max E K p 1,, p K ) = K 1 1 p 1,,p K K 5 Behavior for increasing K max E L ) > max E K ), L > K 6 For independent probability distributions E K+L p q) = E K p 1,, p K ) + E L q 1,, q L ) 15

16 Shannon s 1948) measure of uncertainty E K = where we set p ln p = 0, if p = 0 K p k ln p k, k=1 262 Maximum entropy principle Aims to give as broad a distribution as possible, subject to satisfaction of the constraints, which can be formalized as max p 1,,p K E K p 1,, p K ) st y = Xp, 1 Kp = 1, p 0 This will result in the vector p that can generate the greatest number of outcomes consistent with the data The analytical solution of this problem can be obtained by maximizing the Lagrangian function L = p ln p + λ y Xp) + µ 1 1 Kp), with optimality conditions L p = ln p 1 X λ µ = 0 K L λ = y Xp L µ = 1 1 Kp = 0 We can solve for ˆp as a function of ˆλ as ˆp = ) exp X ˆλ K k=1 exp X ˆλ ) Because ˆp is a function of ˆλ, the maximum entropy distribution does not have a closed form solution We must use numerical optimization techniques to compute ˆp 16

17 For continuous probability distributions we maximize ln f x) f x) dx, subject to f x) dx = 1 g r x) f x) dx = ḡ r, where the functions g r ) are for example x, x µ x ) 2, and the ḡ r are their observed sample counterparts 263 Use More unknowns than observations Non-invertibility of X, due to linearly dependent columns, Inconsistent data Traditional estimation procedures can lead to Arbitrarily fixed parameters Undefined solutions Unstable estimates 3 Some Common Tests The most commonly used test procedures are reviewed in the next subsections 31 Likelihood ratio test Consider θ be a vector of parameters to be estimated by ML) and let H 0 be a restriction of some kind on θ, restricting the parameter vector by c degrees of freedom 3 Let ˆθ U and ˆθ R be 3 For example θ 1 = 0 is a restriction of degree 1 θ 1 = θ 2 = 0, on the other hand, is a restriction of degree 2, while θ 1 = θ 2 is a restriction of degree only 1 17

18 the ML estimates of the unrestricted, respectively the restricted models, and let ˆL U and ˆL R be the corresponding estimated likelihood functions, evaluated at their corresponding estimates The Likelihood Ratio LR) is defined by λ = ˆL R ˆL U Under the regularity conditions given in sub-subsection 252 and under H 0 : ˆθ R = ˆθ U 2 ln λ = 2 {ln ˆL R ln ˆL } U D χ 2 c Note that we need to estimate both the restricted and the unrestricted model in order to be able to perform this test 32 Lagrange multiplier LM ) test The LM or efficient) score test is purely based on the restricted model, ie under H 0 It is based on the maximization of l θ), subject to the set of c constraints that c θ) = q The Lagrangian function for this problem is l θ) = l θ) + λ [c θ) q], with λ a c 1 vector of Lagrangian multipliers The constrained solution ˆθ R is the root of l θ) l θ) θ = θ + λ c θ) θ = 0 l θ) λ = c θ) q If the restrictions are valid ˆθ R ˆθ U, or λ 0 c and we can thus base a test on the vector l θ) θ 0 θ=ˆθr 18

19 To that order define the LM statistic as LM = l θ) θ θ=ˆθr ) ) I ˆθR 1 l θ) θ θ=ˆθr) Under H 0 : ˆθ R = ˆθ U LM D χ 2 c Note that we only need to estimate the restricted model in order to be able to perform this test 33 Wald test As before, let ˆθ U be the ML estimate of the unrestricted model We can now test the set of restrictions c θ) = q by means of the Wald statistic W = ) ) c ˆθU q c θ) θ θ=ˆθu ) ) I ˆθR 1 c θ) θ θ=ˆθu ) ) 1 ) ) c ˆθU q Under H 0 : ˆθ R = ˆθ U W D χ 2 c Note that we only need to estimate the unrestricted model in order to be able to perform this test 34 Hausman Test Hausman 1978)suggests 4 to test the difference between two estimators, ˆβE being consistent and efficient under H 0, but inconsistent under H A and ˆβ C being consistent under both H 0 and H A Because an inefficient estimator like ˆβ C can always be written as the sum of an efficient estimator like ˆβ E and a random vector that is uncorrelated with ˆβ E, the covariance matrix of 4 Although this principle was used previously by Durbin 1954) and Wu 1973), Hausman 1978) was the first to formalize it in such a general way MacKinnon, 1992) 19

20 ˆβ C ˆβ E is simply the difference of their covariance matrices The proposed test is thus H = ˆβC ˆβ ) { [ ] [ ]} 1 E Var ˆβC Var ˆβE ˆβC ˆβ ) E Under H 0, H D χ 2 c, with c the degrees of freedom of the restriction 35 Sargan-Hansen J -Test GMM estimators are motivated by setting sample analogues of population orthogonality restrictions as close to zero as possible If the model is just identified, we achieve this exactly, and nothing is left to test If the model is over-identified, we can test whether the over-identifying restrictions are set close enough to zero to be consistent with their validity, when evaluated at the optimal GMM parameter estimates If they are small enough, we do not reject the validity of the moment conditions used Otherwise we reject Very loosely, this is like testing for correlation between the model residuals and a subset of) the instruments used In the model above y i = f x i, β) + u i z i = g x i ) E [z i u i ] = 0, with β a K 1 vector of parameters and z i a L 1 vector of instruments, the GMM estimator of β was defined as ˆβ GMM = arg min b J N b) The main result of interest here, is that, under H 0 : E [z i u i ] = 0), the minimized optimal GMM 20

21 criterion NJ N ˆβOGMM ) is asymptotically distributed as a central chi-square distribution 5 with L K degrees of freedom NJ N ˆβOGMM ) = Nc N ˆβOGMM ) Ŝ 1 c N ˆβOGMM ) D χ 2 L K, with ˆβ OGMM an optimal estimator and Ŝ is a consistent estimate of S The intuition of the argument is as follows Under the conditions for ˆβ OGMM to be asymptotically normally distributed, it holds that N Ĉ c N ˆβOGMM ) D N 0; I L ), where Ŝ 1 = ĈĈ Defining Ĝ = Ĉ ˆD we have that N ˆβOGMM β) = Ĝ Ĝ) 1 Ĝ NĈ c N β) + o p 1), ) and, using a first-order expansion for c N ˆβOGMM around β, h NĈ c N ˆβOGMM ) = ) NĈ c N β) + Ĉ ˆD N ˆβOGMM β + o p 1) { Ĝ ) } 1 N = I L + Ĝ Ĝ Ĝ Ĉ c N β) + o p 1) Since the limit of { Ĝ } 1 I L + Ĝ) Ĝ Ĝ is idempotent and has rank L K, h h D χ 2 L K When the value of the Sargan/Hansen test statistic is too large, relative to critical values of the appropriate χ 2 L K distribution, reject the validity of the set of moment conditions used Consider now a partition of the R = R A + R B moment conditions as 5 Under the null that all instruments are valid E [ψ A w, β)] = 0 E [ψ B w, β)] = 0, 21

22 in R A and R B moment conditions, where β is a K 1 vector of parameters, with R A > K We wish to test the restrictions E [ψ B w, β)] = 0, taking E [ψ A w, β)] = 0 as given Optimal GMM, using the restricted set moment conditions results in ˆβ A, while ˆβ is the estimator when all ) moment conditions are used Similarly, NJ NA ˆβA is the minimized optimal GMM criterion ) using only the restricted moment set, while NJ N ˆβ is the minimized optimal GMM criterion using all moment conditions From the previous result we know that ) NJ NA ˆβA ) NJ N ˆβ D D χ 2 R A K χ 2 R K It can now also be shown Blundell and Bond, 1998) that ) ) J d = NJ N ˆβ NJ NA ˆβA D χ 2 R B Furthermore J d is asymptotically independent of NJ NA ˆβA ) 4 Models, parameters of interest and the incidental parameters problem 41 Review of some nonlinear models 411 Discrete choice Consider a binary outcome dependent variable y it = 0/1, which is thought to depend on some co-variates x it, such that Pr [y i = 1 x i ] = F x iβ) The log-likelihood for such a model is given by l β) = N {y i ln F x iβ) + 1 y i ) ln [1 F x iβ)]}, i=1 22

23 with gradient with respect to β given by l β) β = N [ y i ) F x i β) F x i β) β i=1 1 y i) 1 F x i β) F x i β) β )] x i To model above relationship sensibly, the function F ) should be monotonously increasing from 0 to 1 Common choices are F x iβ) = Φ x iβ) in the probit model and F x iβ) = expx β) i β) in the logit model 1+expx i 412 Censoring A general formulation of the censored regression or tobit model in terms of an index function, y i = α x i + ε i y i = y i y i > 0) Assuming a normal distribution for ε, the log-likelihood of the above model is given by [ l α) = 1 ) ] ln 2π) + ln σ 2 yi α 2 x i σ y i >0 y i =0 ln 1 Φ )) α x i, σ where the first term is identical to the log-likelihood of a linear model, and the second term to the log-likelihood of a probit model It is a mixture of discrete and continuous distributions 413 Selection Consider following first-stage model, where the selection is determined by the variable z i = z i > 0) z i = α w i + u i, and the equation of interest is given by y i = β x i + ε i 23

24 Assuming u ε N 0; σ2 u ρσ u σ ε ρσ u σ ε σε 2, we have for the observed outcomes E [y i z i ] = β x i + ρσ ε λ, where λ = φq) /Φq), with q = α w i/σu Considering for a moment the nonlinear specification y i = f 1 x i ) + ε i, E [ε i x i ] = 0 z i = 1 z i = f 2 x i ) + u i > 0), which implies that E [y i x i ] = f 1 x i ) E [y i x i, z i = 1] = f 1 x i ) + E [ε i x i, f 2 x i ) + u i > 0] The crucial point for identification of f 1 x i ) is that E [ε i x i, f 2 x i ) + u i > 0] = E [ε i f 2 x i ), f 2 x i ) + u i > 0], ie the conditional expectation of ε i depends on x i through f 2 x i ) only 414 Count data A count data model captures the number of occurrences of an event per period A popular model is the Poisson regression model, in which the probability of each outcome is given by Pr [y i = k] = exp λ i) λ k i k!, k N 24

25 and the expected number of events per period E [y i ] = λ i, which coincides with the variance of the number of events per period The λ i can be written in terms of co-variates, which commonly takes the form λ i = exp γ x i ) The log-likelihood function is given by l γ; y i ) = N { exp γ x i ) + y i γ x i ) ln y i!)}, i=1 with gradient l γ) γ = N y i λ i ) x i i=1 and Hessian 2 l γ) γ γ = N λ i x i x i The Poisson distribution can be generalized by introducing an unobserved effect into the conditional mean i=1 E [y i ] = exp γ x i + ε i ) = λ i u i = µ i Conditional on x i and u i, y i is Poisson distributed Pr [y i = k x i, ε i ] = exp µ i) µ k i k! 25

26 The distribution conditional only on x i, is obtained by integrating out 6 u i Pr [y i = k x i ] = 0 exp λ i u i ) λ i u i ) k g u i ) du i k! Assuming now that u i follows a Γ-distribution, ie g u i ) = βα Γ α) exp βu i) u α 1 i, and normalizing E [u i ] = α /λ = 1, we obtain Pr [y i x i ] as Pr [y i x i ] = = = = = λ y i i αα Γ y i + 1) Γ α) 0 λ y i i αα 1 Γ y i + 1) Γ α) λ i + α) α+y i λ y i i αα 1 Γ y i + 1) Γ α) λ i + α) α+y i λ y i i αα Γ y i + α) Γ y i + 1) Γ α) λ i + α) α+y i Γ y i + α) λi Γ y i + 1) Γ α) λ i + α exp [λ i + α] u i ) u y i+α 1 i 0 0 ) yi α λ i + α du i exp [λ i + α] u i ) [λ i + α] y i+α 1 u y i+α 1 i [λ i + α] du i exp t) t y i+α) 1 dt ) α This is one possible formulation of the negative binomial distribution, which has E [y i x i ] = λ i and Var [y i x i ] = λ i /α α + λ i ) 42 Policy parameters The quantity we are usually interested in, is the effect of x on y In the linear model y = x β + η + ε, this quantity is equal to β, identically for all individuals In binary choice models β convey information on the relative impact of x on Pr [y = 1 x, η] The magnitude of the effect of a change in one x 1 on Pr [y = 1 x 1, x 2, η] now depends on the whole vector x 1, x 2, η) For a continuous x 1 this impact is given by 6 Pr [y i x i ] = E ui [Pr [y i x i, u i ]] x 1 Pr [y = 1 x, η] = x 1 F x β + η) = β 1 f x β + η), 26

27 and for a discrete variable Pr [y = 1 x 1 = b, x 2, η] Pr [y = 1 x 1 = a, x 2, η] = F bβ 1 + x 2 β 2 + η) F aβ 1 + x 2 β 2 + η) The parameter we are interested in is the population average of the above quantities, ie {F bβ 1 + x 2 β 2 + η) F aβ 1 + x 2 β 2 + η)} dg η, x 2 x 1 = a) 3) Chamberlain 1984) proposed the mean effect of a randomly drawn individual {F bβ 1 + x 2 β 2 + η) F aβ 1 + x 2 β 2 + η)} dg η, x 2 ) 4) Expressions 3) and 4) measure different quantities Expression 3) is appropriate if we want to measure, say, the effect on female labor participation, of having a third x 1 = 3 child, for those women that already have two children x 1 = 2 In this case the unobserved individual effects η will be differently distributed, compared to the total population Expression 4) would be the average effect of going from two to three children In case x 1 is a treatment, 3) is the average treatment on the untreated, and 4) is the average treatment effect The overall effect of having one extra child is given by {F x 1 + 1) β 1 + x 2 β 2 + η) F x 1 β 1 + x 2 β 2 + η)} dg η, x 1, x 2 ) In case of a dynamic model y it = 1 y i,t 1 α + x itβ + η i + v it 0), the long run 7 effect of a change in x on Pr [y = 1 x, η] is given by 7 We have that { x F x itβ + η i ) 1 F α + x it β + η i) + F x it β + η i) } Pr [y it = 1 x, η] = Pr [y i,t 1 α + x itβ + η i + v it 0 x, η] = Pr [α + x itβ + η i + v it 0 x, η] Pr [y i,t 1 = 1 x, η] + Pr [x itβ + η i + v it 0 x, η] 1 Pr [y i,t 1 = 1 x, η]) 27

28 43 The incidental parameter problem Consider a panel model with N agents observed for T periods The model contains agent-specific individual effects η i, together with the K 1 vector θ parameters of interest Application of ML to estimate all parameters usually yields inconsistent N ) estimates for the common parameters θ Since the complete parameter vector η 1 η 2,, η N, θ) has N + K elements and thus infinitely grows for N, standard consistency theorems fail Example 1 Consider the model with individual-specific means y it N η i, σ 2 ), for which the i th term of the log-likelihood can be written as l i = T 2 ln σ2 1 2σ 2 T y it η i ) 2 t=1 Maximizing the likelihood results in ˆη i = ȳ i ˆσ 2 = 1 NT y it ȳ i ) 2, i,t from which it is clear that plim ˆσ 2 = σ 2 σ2 N T Example 2 Chamberlain 1992) studied identification of the following fixed effects binary choice model with T = 2 y it = 1 αx it + η i + v it 0), where the v it are IID over time, independent of x it and η i, with known CDF F for which the usual restrictions apply Furthermore, the vectors of explanatory variables In the steady state, it holds that Pr [y it = 1 x, η] = Pr [y i,t 1 = 1 x, η], and thus that Pr [y it = 1 x, η] = = Pr [x it β + η i + v it 0 x, η] 1 Pr [α + x it β + η i + v it 0 x, η] + Pr [x it β + η i + v it 0 x, η] F x it β + η i) 1 F α + x it β + η i) + F x it β + η i) 28

29 and of parameters are partitioned as x it = d t, z it) and α = β, γ ), where d t is a time dummy such that d 1 = 0 and d 2 = 1, and z it is a continuous RV with bounded support Chamberlain showed that under these conditions there exists a value for β such that identification fails for all α in a neighborhood of β, 0) 44 Terminology In the present context, fixed effects refers to a model for the effect of x it on y it, given x it observed) and η i unobserved), whereby the distribution of η i x it is left unspecified, while random effects denotes a model in which some knowledge about the form of η i x it is assumed Arellano 2003) Part II Linear Models 5 Static y it = x itβ + z iγ + η i + v it, with x it and z i a K x 1, respectively K z 1, vector of explanatory variables, for i = 1,, N and t = 1,, T Stacked above equation looks like Y = Xβ + Zγ + H + V, where Y = Y 1 Y 2 ;Y j = y j1 y j2 Y N y jt 29

30 51 Uncorrelated individual effects E [x it η i ] = 0 5) 511 OLS Assumption A x it is predetermined or contemporaneously uncorrelated with the idiosyncratic error term: E [x it v it ] = 0 6) Expressions 5) and 6) imply E [x it u it ] = 0, where u it = η i + v it Thus ˆβ OLS as β, for N T N, T In general, under both 5) and 6), pooled estimators are consistent Remark that this rules out serial correlation in the v it 512 GLS UU = Ω = Ω i Ω i Ω i, 30

31 with Ω i = σ 2 vi N + σ 2 ηj N, where J N is a N N matrix filled with 1s Under 5) and 6) it holds that ˆβ GLS as β, for N T, N, T where ˆβ GLS = X Ω 1 X ) 1 X Ω 1 Y This estimator can be obtained by θ-differencing the data, ie y d it = y it 1 θ) ȳ i, where and ȳ i = T 1 T s=1 y is θ = σ 2 v, σv 2 + T ση 2 and subsequently by estimating this transformed model by OLS 513 FGLS Feasible GLS uses consistent estimates of σ 2 v and σ 2 η to estimate θ consistently It holds that ˆβ F GLS as ˆβ GLS, for N T N, T 31

32 514 Between Groups Estimator Averaging all cross-sections over time results in ȳ i = x iβ + z iγ + η i + v i, for i = 1,, N OLS on this constructed) cross-section is called the between groups estimator Under both 5) and 6), it is consistent, but inefficient 52 Correlated individual effects E [x it η i ] 0 7) Under 7), OLS suffers from the omitted variable bias, ie [ ] E ˆβOLS [ ] = β + E X X) 1 X H 521 Within estimator or Least Squares Dummy Variables LSDV ) estimation Assumption B x it is strictly exogenous: E [x it v is ] = 0, s, t 8) Assumption B is crucial for fixed T asymptotics It does not allow for lagged dependent variables The within transformation ỹ it = y it ȳ i, 32

33 has as effect that time invariant variables and thus also the individual effects disappear, z i = z i z i = z i z i = 0 Endogeneity that is only caused by 7) is removed by within transforming the data, and, thus, OLS on within-transformed data is consistent for N This is true since E [ x it ṽ it ] = 0, is implied by 8) 522 First differenced OLS First differencing: y it = y it y i,t 1 OLS on first differenced data is consistent for N if E [ x it v it ] = 0, which is implied by 8), but not vice versa 523 Hausman-Taylor IV Hausman and Taylor 1981) consider the model y it = x itβ + z iγ + µ i + ε it i = 1,, N; t = 1,, T ), where x it and z i are k 1 and g 1 vectors of coefficients associated with time-varying and time-invariant observable variables respectively The disturbance ε it is assumed uncorrelated with the vector x it, z i, µ i ) and has zero mean and constant variance σ 2 ε conditional on x it and z i 33

34 The latent individual effect µ i is assumed to be a time-invariant random variable, distributed independently across individuals, with variance σα 2 Standard strategy of transforming the data, either by taking deviations from individual means or first differences, has two drawbacks: 1 time-invariant observable variables z i are eliminated 2 under certain circumstances, the within-groups estimator is not fully efficient Prior information lets us distinguish those elements of x it and z i which are asymptotically uncorrelated with µ i from those which are not For fixed T, let plimn 1 N plimn 1 N N i=1 N i=1 X 1A = 0 k1 plimn 1 N X 1A = c x plimn 1 N N i=1 N i=1 Z 1A = 0 g1 Z 1A = c z, with the k 2 1, respectively g 2 1, vectors c x and c z assumed unequal to zero Define P V = I N T 1 ι T ι T, Q V = I NT P ιt, then pre-multiplication with P V and Q V transforms a variable in group means, respectively deviations therefrom Hausman and Taylor 1981) now consider the set of instruments A = Q V, X 1, Z 1 ) It can be shown that P A = P B, where B = P V X 1, Q V X 1, Z 1 ) A necessary condition for the identification of β, γ ) is that k 1 g 2 A necessary and sufficient condition for the identification of β, γ ) is that det [ X, Z) P A X, Z) ] 0 In addition, they propose to estimate P A Ω 1/2 Y = P A Ω 1/2 Xβ + P A Ω 1/2 Zγ + P A Ω 1/2 µ i + ε it ) i = 1,, N; t = 1,, T ), 34

35 where Ω = Var [µ i + ε it ] = σ 2 εi NT + T σ 2 µp V, by OLS 524 Chamberlain device Model y it = x itβ + z iγ + η i + v it, under FE condition 7) E [x it η i ] 0 Parsimonious version Assume that E [η i x i1,, x it ] = x iθ, and estimate y it = x itβ + z iγ + x iθ + ζ i + v it, where, by construction E [x it ζ i ] = 0 under the assumptions made) On top of the K x + K z explanatory variables, K x degrees of freedom are needed for this type of correction Remark that the LSDV estimator needs N 1 additional degrees of freedom 35

36 Full version Without any further assumption, we have that E [η i x i1,, x it ] = T x isθ s, s=1 which leads to the model T y it = x itβ + z iγ + x isθ s + ψ i + v it, s=1 where, by construction E [x it ψ i ] = 0 This type of correction uses T K x degrees of freedom, which for T small compared to N might still be an improvement over LSDV 53 Hausman test Testing for correlation between the individual effects and the variables is especially useful for fixed T It rests on the observation that ˆβ W is consistent irrespective of this correlation, and ˆβ B, ˆβGLS and ˆβ OLS are inconsistent in the presence of such correlation Define the distance between ˆβ W and ˆβ GLS as q = ˆβ W ˆβ GLS Under H 0 : E [x it η i ] = 0, it holds that q D N 0; Σ q ), with Σ q = Σ ˆβW Σ ˆβGLS 36

37 and qσ 1 q q D χ 2 k 6 Dynamic 61 No exogenous regressors Consider y it = αy i,t 1 + x itβ + z i γ + η i + v it α < 1, 9) for i = 1,, N and t = 2,, T, and assume for now β, γ = 0, ie y it = αy i,t 1 + η i + v it α < 1, which implies y it = 1 α) η i + Consequences of the lagged dependence: α s v i,t s 10) s=0 E [y i,t 1 η i ] = σ 2 η > 0 correlation with the individual effects) E [y i,t 1 v i,t 1 ] = σ 2 v > 0 no strict endogeneity) 611 Previous Estimators As before, the OLS estimator for α suffers from omitted variable bias, which does not vanish with increasing sample size The inconsistency is given by plim ˆα OLS = α + E [ y 2 1 i,t 1] E [yi,t 1 η i ] N α, since E [y i,t 1 η i ] 0 37

38 In this dynamic setting, however, the within estimator also suffers from bias, termed the Nickell 1981) bias, since, T plim ˆα α) = N t=1 T t=1 A t B t with A t = E [ỹ i,t 1 ṽ it ] and B t = E [ ỹ 2 i,t 1] Of these quantities At can be simplified to A t = T 1) 1 E T 1) 1 E + T 1) 2 E s=0 [ T 1 [ y i,t 1 [ T 1 ] T v is s=2 ] y is v it s=1 [ T 1 y is s=1 r=2 ] T v ir which, combined with 10), gives [ ] T A t = T 1) 1 E α s v i,t 1 s v is T 1) 1 E + T 1) 2 E After some manipulation we finally get σ 2 v A t = T 1) 1 α) s=1 r=0 [ T 1 s=2 ] α r v i,s r v it ] T α r v i,s r v iq s=1 r=0 q=2 { ) } 1 α T 1 1 α t 1 α T t 1 + T 1) 1 α) Similarly it holds that B t = ) σv α 1 α 2 T 1) 1 α A t 2 38

39 Combining the expressions for both A t and B t, results in { plim ˆα α) = 1 + α ) } 1 α T 1 1 α T 1 α T t 1 + N T 2 T 1) 1 α) { [ 2α ) ]} 1 1 α T α T 1 α T t 1 +, T 2) 1 α) T 1) 1 α) from which it can be clearly seen that plim ˆα W = α O T 1) N,T The within-estimator is inconsistent for N, but consistent for T and for N, T Remark 1 Do not test H 0 : α = 0 using ˆα W, unless T is large, since lim plim α 0 N,T [ˆα W α] 0 The first-differenced OLS estimator is inconsistent under any type of asymptotic, since E [ y i,t 1 v it ] = E [y i,t 1 v i,t 1 ] Remark 2 Use these estimators as benchmarks For any consistent estimator it holds that ˆα W, ˆα DOLS ˆα ˆα OLS Remark 3 Letting A = t ỹi,t 1ṽ it and B = t ỹ2 i,t 1, the Nickell bias can be written as E [A] /E [B] It is equal to the approximation to the order T 1 of the standard Hurwicz 1950) bias E [A/B] E [ ] A B = E [A] { } Cov [AB] Var [B] 1 + E [B] E [A] E [B] E [B]) 2 + o T 1) 39

40 612 IV Consider the first differenced equation y it = α y i,t 1 + v it, and assume that 6) applies to y i,t 1 The lagged determined variable is thus contemporaneously uncorrelated with the idiosyncratic error term: E [y i,t 1 v it ] = 0 This assumption follows when the v it are serially uncorrelated and are uncorrelated with y i0 As seen above OLS is inconsistent under any type of asymptotic, but we have that E [y i,t 2 v it ] = 0 E [ y i,t 2 v it ] = 0, and thus both y i,t 2 and y i,t 2 are valid instruments for y i,t 1 Anderson and Hsiao 1981) suggested the first type of instrument, resulting in ˆα AH = Y 1Y 2 Y 2Y 2 ) 1 Y 2 Y 1 ) Y 1Y 2 Y 2Y 2 ) 1 Y 2 Y 11) This estimator is consistent as N for fixed T 613 GMM Consider again the AR 1) panel data model 9) without exogenous regressors y it = αy i,t 1 + η i + v it α < 1, and make the following assumptions A1 Error components) E [η i ] = E [v it ] = E [η i v it ] = 0 40

41 A2 Serially uncorrelated shocks) E [v is v it ] = 0 for s t A3 Predetermined initial conditions) E [y i1 v it ] = 0 for t = 2,, T Assumptions A1)-A3) specify a finite number linear moment conditions These can be exploited to construct a GMM estimator First-differenced equations y i3 y i2 = α y i2 y i1 ) + v i3 v i2 ) y i4 y i3 = α y i3 y i2 ) + v i4 v i3 ) y it y i,t 1 = α y i,t 1 y i,t 2 ) + v it v i,t 1 ) Valid instruments y i1 y i1, y i2 y i1, y i2,, y i,t 2 Table 1: Valid instruments for the dynamic model The moment conditions E [y i1 v it ] = 0, for t = 3,, T follow from A3), while E [y i,t s v it ] = 0, for t = 3,, T and s 2 follow from A2) E [v is v it ] = 0, from A1) E [η i v it ] = 0 and from E [y i,s 1 v it ] = 0 These M = T 1) T 2) /2 moments can also be written as E [Z i V i ] = 0, and have as sample analogue where Z i is the T 2) M matrix N c N α) = N 1 Z i V i α), i=1 Z i = y i y i1 y i y i1 y i2 y i,t 2 41

42 and V i is the T 2) 1 matrix V i = v i3 v i4 v it The GMM estimator minimizes the weighted quadratic distance between these moment conditions and zero ˆα GMM = arg min α = arg min α J N α) N 1 N i=1 V i α) Z i ) W N N 1 N i=1 Z i V i α) = Y 1ZW N Z Y 1 ) 1 Y 1ZW N Z Y, 12) ) where Y and Y 1 are the stacked N T 2) 1 vectors of observations on y it respectively y i,t 1, Z = Z 1,, Z N ) is the stacked N T 2) M matrix of observations on the instruments and W N is a M M weight matrix The simple expression on the last line arises from the linearity of the moment conditions For arbitrary W N, the asymptotic covariance matrix of N ˆα GMM can be estimated by Σ N ˆα GMM = N Y 1ZW N Z Y 1 ) 1 Y 1ZW N ˆΣV W N Z Y 1 Y 1ZW N Z Y 1 ) 1, with ˆΣ V = N 1 N i=1 Z i ˆV i ˆV i Z i and ˆV i = Y ˆα GMM Y 1 The optimal GMM estimator chooses W N = variance can thus be estimated by 1 ˆΣ V Its asymptotic Σ N ˆα GMM = N ) 1 Y 1Z 1 ˆΣ V Z Y 1 Remark 1 For T = 3, we have only 1 moment condition E [y i1 v i3 ] = 0 Consequently, our 42

43 only parameter is exactly identified Since weighting is irrelevant in this case, we obtain the optimal GMM estimator It can easily be shown that ˆα GMM = ˆα AH, for T = 3 Remark 2 For T > 3, ˆα GMM is more efficient than ˆα AH Firstly, ˆα GMM exploits more moment conditions, and secondly because W N = Y 2Y 2 ) 1 is not optimal for first-differenced equations Optimal One Step In the special case that v it IID 0; σ 2 v), the optimal GMM estimator can be obtained in one step This optimal choice of weight matrix does not coincide with 2SLS, since first-differencing introduces serial correlation in the error terms v it It can easily be shown that E [ V i V i ] = σ 2 vh, with H = Consequently, choosing W N = N 1 N i=1 Z ihz i ) 1 is asymptotically equivalent to the optimal two-step estimator results in a one step GMM estimator that Extra Moment conditions Without making extra assumptions beyond A1)-A3), a extra set of T 3 moment assumptions, quadratic in α, can be exploited Ahn and Schmidt, 1995) E [u it v i,t 1 ] = 0 for t = 4,, T, 13) where u it = η i +v it Moment conditions that are non-linear in the parameters require numerical optimization and are less common in practice Under the additional homoskedasticity assumption E [ v 2 it] = σ 2 i, 43

44 there are a further T 3 linear moment conditions Ahn and Schmidt, 1995) E [y i,t 2 v i,t 1 y i,t 1 v it ] = 0 for t = 4,, T 14) Further Issues Over-fitting Too many instruments relative to the sample size N) is a source of finite sample bias For small N and large T over-fitting is a distinct possibility, resulting in a downward) finite sample bias towards the within estimator Always check the sensitivity of empirical results with respect to the number of lags present in the set of instruments For T, the within estimator is consistent, so first-differenced GMM is also consistent but not very useful) see Alvarez and Arellano, 2003, for a formal proof) In the presence of endogenous explanatory variables, however, the within estimator is no longer consistent as T Weak Instruments and Unit Roots IV and GMM have poor small sample properties when instruments are only weakly correlated with the endogenous regressors For AR 1) models with α 1, the correlation between y i,t 1 and the lagged levels y i,t s for s 2 becomes weaker Formally, α remains identified as α 1 and the first-differenced GMM estimator remains consistent as N, provided ση 2 0 But Monte Carlo evidence suggests first-differenced GMM estimators to be very imprecise for α 08 For persistent time series, Blundell and Bond 1998) developed an extended GMM estimator see below) Consider now the alternative specification y it = η i + ε it ε it = αε i,t 1 + v it, which can be reformulated as y it = αy i,t α) η i + v it 44

45 Under this specification, the process for y it approaches a pure random walk as α 1 instead of a random walk with drift above) Lagged levels are completely uninformative instruments for y i,t 1 as α = 1 and α is not identified using the moment conditions E [y i,t s v it ] = 0, for t = 3,, T and s 2 Remark that for this model OLS in levels is consistent when α = 1 System GMM It is possible that y it is uncorrelated with η i, while y it is obviously correlated with η i This requires a further restriction on the process generating the initial conditions y i1 E [ y i2 η i ] = 0 15) The validity of 15) has two implications 1 The AR 1) specification implies t 3 y it = α t 2 y i2 + α s v i,t s for t = 3,, T, s=0 which, together with 15) implies the T 2 non-redundant linear moment conditions E [ y is η i ] = 0 for t = 2,, T for the equation in levels, which can, for example, be rewritten as E [ y is u it ] = 0 for t = 2, 3,, T 1) 2 Validity of these additional linear moment restrictions renders the quadratic moment restrictions 13) redundant For example it holds that u it u i,t 1 = u it y i,t 1 α y i,t 2 ) = u it y i,t 1 αu it y i,t 2, 45

46 and the assumptions E [u it y i,t 1 ] = 0 and E [u it y i,t 2 ] = 0 imply E [u it u i,t 1 ] = 0 Conveniently, the complete set of moment conditions 8 implied by the standard assumptions T 1) T 2) /2) and the initial conditions restriction T 2) can be written as E [y i,t s v it ] = 0 for t = 3,, T ; s 2 E [ y is u it ] = 0 for s = 2, 3,, T 1), which can be restated as E [ ] Zi S Ui S = 0, and has as sample analogue where Z S i is the T 1) M S matrix 9 c N α) = N 1 N i=1 Z S i U S i α), Z S i = y i y i1 y i y i1 y i2 y i,t = y i2 y i,t 1 Z i y i2 y i,t 1, 8 Moreover, all these restrictions are linear 9 M S = T + 1) T 2) /2 46

47 and U S i is the T 1) 1 vector U S i = v i3 v i4 v it η i + v it Finally, define also the T 1) 1 vector Y S i Y S i = y i3 y i4 y it, y it and the vector equation 10 Y S i = αy S i, 1 + U S i, represents T 2 scalar equations in first differences, with one equation in levels added The GMM estimator minimizes the weighted quadratic distance between the sample moment conditions and zero ˆα SGMM = arg min α = arg min α J N α) N 1 N i=1 Ui S α) Zi S ) W S N N 1 N i=1 Z S i U S i α) = ) Y 1Z S S WNZ S S Y 1 S 1 Y S 1Z S WNZ S S Y S 16) ) where Y S = ), Y1 S,, YN S Y S 1 = ), Y1, 1, S, YN, 1 S Z S = ) Z1 S,, ZN S and WN S is a M S M S weight matrix 10 As before Yi, 1 S = L ) Yi S, with L ) the lag operator 47

48 In some cases these additional moment conditions for the equation in levels can provide huge efficiency improvements and a reduction in finite sample bias A persistent y it series is such an example In this case, the quadratic moments conditions give a substantial improvement, but the additional linear restrictions stemming from the initial conditions restriction, provide an even bigger gain in efficiency Blundell and Bond, 1998) In order to guarantee that y i2 is uncorrelated with η i, a restriction on the behavior of y i1 is needed This takes the form of a stationarity restriction on the y it series Indeed, the representation t 3 y it = α t 2 y i2 + α s v i,t s for t = 3,, T, s=0 indicates that y it becomes uncorrelated with η i, for t If the same process has been generating the y it series long enough 11 before the observation period, the observations on y it are uncorrelated with η i More formally, define e i1 = y i1 η i 1 α, then we have that y i2 = α 1) y i1 + η i + v it = α 1) e i1 + v i2 Since the error components assumption A1) states that E [v i2 η i ] = 0, a sufficient condition for E [ y i2 η i ] to be equal to zero, is given by the restriction E [e i1 η i ] = 0 η i 1 α is the steady state level of the y it series, ie the level the series will converge to for T Consequently e i1 represents the deviation from the steady state at the start of the sample period The additional initial conditions assumption 15) is thus equivalent to the requirement that the initial deviations from the steady state are uncorrelated with the steady state level 11 Long enough for the influence of the true start-up conditions to have become negligible 48

49 This imposes a restriction 12 on the mean of the y it series, but not on its variance When the process we observe has started long ago, this assumption is quite reasonable On the other hand, if the observation period coincides with the true start-up of a process for instance the start of the Euro) this assumption may be unreasonable The system GMM estimator is similar to the case where levels or first-differences of some x it variable can be used to obtain instruments for the equation in levels see subsection 622 below) Serially correlated errors 62 Exogenous regressors Consider again model 9) y it = αy i,t 1 + x itβ + z i γ + η i + v it α < 1, = s itδ + u it, for i = 1,, N and t = 2,, T, where s it = y i,t 1, x it) and δ = α, β ) We assume that x it is a K 1 vector of variables and we maintain assumptions A1)-A3) Consequently, the moment conditions E [y i,t s v it ] = 0, for t = 3,, T and s 2, remain valid Different assumptions about x it will imply different extra sets of moment conditions 13 The usual tradeoff between efficiency and robustness applies here More restrictive assumptions imply the validity of additional moment restrictions, which will increase efficiency when they are valid, but imply inconsistency when violated 621 Moment conditions for the equation in differences Example 1 We assume that x it is predetermined with respect to the serially uncorrelated shocks v it : E [x is v it ] = 0, for s t, 12 Sometimes called mean stationarity 13 x it can either be correlated or uncorrelated with η i ; x it may be endogenous, predetermined or strictly exogenous with respect to v it 49

50 but correlated with the individual effects E [x it η i ] 0 These assumptions imply that present shocks v it possibly affect future values of x is where s > t) First-differencing eliminates η i y it = α y i,t 1 + β x it + v it, and the assumption of predeterminedness implies the moment conditions E [x i,t s v it ] = 0, for t = 3,, T ; s 1, ie lagged values of x it are uncorrelated with v it = v it v i,t 1 The complete set of linear moment conditions can again be written as E [ Z + i V i ] = 0, but with the matrix Z i now equal to Z + i = y i1 x i1 x i y i1 y i2 x i1 x i2 x i , y i1 y i2 y i,t 2 x i1 x i,t 1 and GMM proceeds as before Denoting the stacked N T 2) K + 1) matrix of observations on s it = y i,t 1, x it), the GMM estimator of δ is given by ˆδ GMM = X Z + W N Z + X ) 1 X Z + W N Z + Y Example 2 If we assume that x it is endogenous with respect to the serially uncorrelated 50

51 shocks v it : E [x is v it ] = 0, for s < t, then only the subset E [x i,t s v it ] = 0, for t = 3,, T ; s 2 remains valid In this case the treatment of x is and y is in the instrument matrix is symmetric Example 3 If x it is strictly exogenous with respect to the shocks v it : E [x is v it ] = 0, s, t, the the larger set of moment conditions E [x i,t s v it ] = 0, for t = 3,, T ; s = 1, 2,, T, would be valid, which would imply the instrument matrix Z i = y i1 x i1 x it y i1 y i2 x i1 x T y i1 y i2 y i,t 2 x i1 x i,t Implementation of the different alternatives consists simply of adding or deleting columns from the instrument matrix Z i 622 Moment conditions for the equation in levels Assume that x it is strictly exogenous with respect to the shocks v it and uncorrelated with the individual effects We have T observations on x it which are all uncorrelated with η i, which 51

52 provides us with T valid moment conditions for the equation in levels One possibility 14 is to write these as E [x is u it ] = 0, for s = 1, 2,, T, with u it = η i + v it The full set of linear) moment conditions can now be written as E [Z i U i ] = 0, where Z i = y i1 x i1 x it y i1 y i2 x i1 x it y i1 y i2 y i,t 2 x i1 x i,t x i1 x it or Z i = Z i x i1 x it, and U i = v i3 v it η i + v it We thus augment the system of first-differenced equations by adding the levels equation for the final period, and augment the instrument matrix by adding the T valid instruments for this equation As before the GMM estimators are based on the sample analogues of these moment conditions ˆδ GMM = X Z W N Z X ) 1 X Z W N Z Y 14 This possibility is elegant but useless for unbalanced panels 52

53 When x it is only predetermined with respect to v it, the instrument matrix changes to Z i = Z+ i x i1 x it On the other hand, when x it is endogenous with respect to v it, but still uncorrelated with η i ), there are only T 1 extra moment conditions E [x is u it ] = 0, for s = 1, 2,, T 1, and the instrument matrix loses one column Remark There is no obvious choice for the one step weight matrix when one or more levels equations are added to the system The assumption that v it IID 0; σv) 2 does not imply a form for E [ U + i U ] + i that is proportional to a known matrix As a consequence, the efficiency improvement between one step and optimal GMM is expected to be more substantial, compared to the situation in which there are only moment restrictions in first differences Another possibility is that, although x it is correlated with η i, some known function of x it is uncorrelated with η i and can thus be used to form valid instruments for the equations in levels For instance, when the covariance between x it and η i is constant over time, the first differences of x it are uncorrelated with η i Arellano and Bover suggest to use suitably dated first differences x is as instruments for the levels equation, where the suitability depends on the correlation between x is and η i This correlation in turn depends on whether x it is assumed to be endogenous, predetermined or strictly exogenous with respect to v it Presence of variables that are uncorrelated with the individual effects thus allows the use of the levels equation Use of the levels equation in turn opens up the possibility to identify coefficients on time-invariant explanatory variables z i In contrast, if all valid moment conditions require transformation of the model in order to eliminate the individual effects η i, time-invariant variables z i are eliminated as well, and their coefficients γ are not identified 53

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE Chapter 6. Panel Data Joan Llull Quantitative Statistical Methods II Barcelona GSE Introduction Chapter 6. Panel Data 2 Panel data The term panel data refers to data sets with repeated observations over