Microeconometrics (PhD) Problem set 2: Dynamic Panel Data Solutions

Size: px

Start display at page:

Download "Microeconometrics (PhD) Problem set 2: Dynamic Panel Data Solutions"

Victor Knight
5 years ago
Views:

1 Microeconometrics (PhD) Problem set 2: Dynamic Panel Data Solutions QUESTION 1 Data for this exercise can be prepared by running the do-file called preparedo posted on my webpage This do-file collects the variables needed for the estimation from the 8 personal files of the European Community Household Panel (ECHP) and put them all together in a unique file named earnings_itdta This data file already contains the variables you need for the estimation: female, age, experience, years of schooling (edu), log of earnings (lny), a personal id number (npid) and the number of periods the person is in the sample with non-missing information on all necessary variables (N) a 2 η 2 ν Var (ln ) = σ + σ where the first term represent the permanent component of the y it variance and the second the temporary component b The output for this and the following sub-questions is produced using the do-file called PS2_q1_0708do Here, I compute the component of the total variance as: σ σ 2 ν 2 η = = 1 NT 1 N N N i= 1 T i= 1 t= 1 (ln y (ln y i it ln y) ln y ) 2 i 2 σ T 2 ν ******************************************************************************** ************************************************ *** ESTIMATE VARIANCE COMPONENTS *** ************************************************ use earnings_itdta, clear by npid: egen ybar_i=mean(lny) by npid: egen var_yi=sd(lny) keep if i==1 (14595 observations deleted) replace var_yi=var_yi^2 (2083 real changes made) egen var_v=mean(var_yi) replace var_v=var_v (0 real changes made)

2 egen var_ybar=sd(ybar) replace var_ybar=var_ybar^2 (2085 real changes made) g var_eta=var_ybar-(var_v/8) g sd_v=sqrt(var_v) g sd_eta=sqrt(var_eta) g ratio_eta=var_eta/(var_v+var_eta) g ratio_v=var_v/(var_v+var_eta) su sd_eta sd_v ratio_eta ratio_v Variable Obs Mean Std Dev Min Max sd_eta sd_v ratio_eta ratio_v ******************************************************************************** And now I prepare data for the Random Effect estimates In particular, you need to tsset your data and to do that you need a numeric id number (our npid is a string) I do get the same results ******************************************************************************** use earnings_itdta, clear destring npid, g(id) npid has all characters numeric; id generated as long tsset id wave panel variable: id (strongly balanced) time variable: wave, 1 to 8 xtreg lny, re Random-effects GLS regression Number of obs = Group variable (i): id Number of groups = 2085 R-sq: within = Obs per group: min = 8 between = avg = 80 overall = max = 8 Random effects u_i ~ Gaussian Wald chi2(0) = 000 corr(u_i, X) = 0 (assumed) Prob > chi2 = lny Coef Std Err z P> z [95% Conf Interval]

3 _cons sigma_u sigma_e rho (fraction of variance due to u_i) ******************************************************************************** c Now I add the covariates: ********************************************************************* **** add covariates **** xtreg lny female age agesq exp expsq edu, re Random-effects GLS regression Number of obs = Group variable (i): id Number of groups = 2085 R-sq: within = Obs per group: min = 8 between = avg = 80 overall = max = 8 Random effects u_i ~ Gaussian Wald chi2(6) = corr(u_i, X) = 0 (assumed) Prob > chi2 = lny Coef Std Err z P> z [95% Conf Interval] female age agesq exp expsq edu _cons sigma_u sigma_e rho (fraction of variance due to u_i) *******************************************************************************

4 QUESTION 2 The output for this question is produced using the file PS2_q2_0708do First of all, I prepare and clean the original data according to the indication of the text of the problem **************************************************************************** ************************************************ ********** CLEAN AND PREPARE DATA ********** ************************************************ use abdatadta, clear keep if year>=1977&year<=1982 (193 observations deleted) sort id year egen smpl=rowmiss(id year n w k ys) drop if smpl>0 (0 observations deleted) by id: g N=_N keep if N==6 (10 observations deleted) egen wave=group(year) keep id year wave n w k ys order id year wave n w k ys sort id wave save abdatanewdta, replace **************************************************************************** a Now, run the two models: **************************************************************************** tsset id wave panel variable: id (strongly balanced) time variable: wave, 1 to 6 *RUN THE HAUSMAN TEST USING STATA'S ROUTINE xtreg n w k ys, re

5 Random-effects GLS regression Number of obs = 828 Group variable (i): id Number of groups = 138 R-sq: within = Obs per group: min = 6 between = avg = 60 overall = max = 6 Random effects u_i ~ Gaussian Wald chi2(3) = corr(u_i, X) = 0 (assumed) Prob > chi2 = n Coef Std Err z P> z [95% Conf Interval] w k ys _cons sigma_u sigma_e rho (fraction of variance due to u_i) est store random xtreg n w k ys, fe Fixed-effects (within) regression Number of obs = 828 Group variable (i): id Number of groups = 138 R-sq: within = Obs per group: min = 6 between = avg = 60 overall = max = 6 F(3,687) = corr(u_i, Xb) = Prob > F = n Coef Std Err t P> t [95% Conf Interval] w k ys _cons sigma_u sigma_e rho (fraction of variance due to u_i) F test that all u_i=0: F(137, 687) = Prob > F = est store fixed **************************************************************************** b And perform the Hausman test using Stata s routine (pay attention to the order in which you call the estimators in Hausman First the consistent, then the efficient) **************************************************************************** ********************

6 *** HAUSMAN TEST *** ******************** ---- Coefficients ---- (b) (B) (b-b) sqrt(diag(v_b-v_b)) fixed random Difference SE w k ys b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(3) = (b-b)'[(v_b-v_b)^(-1)](b-b) = 7380 Prob>chi2 = (V_b-V_B is not positive definite) ********************************************************************* You might get a message like the one here at the bottom that warns you that the matrix V_b-V_B is not positive definite This may occur occasionally since Hausman uses two different estimates of σ ν for the computation of V_b and V_B (namely the estimates obtained from the fixed- and the random-effects models respectively) This means that, although asymptotically they converge to the same true value, they can differ in small samples and lead to a non-positive V_b-V_B You can solve the problem by forcing Hausman to use the same estimate of σ ν for both variancecovariance matrices (use the option sigmaless or sigmamore for this) To know more, look up in the manuals the description of how xtreg computes the variance components under the different models Also, take a look at this user discussion about this topic: You can replicate results by saving the estimated coefficients and variance-covariance matrices of the two models and computing the test manually ********************************************************************* *REDO THE TEST MANUALLY est restore random (results random are active now) matrix RE=e(b)' matrix RE=RE[13,] matrix Vre=e(V) matrix Vre=Vre[13,13] est restore fixed (results fixed are active now)

7 est store WG matrix WG=e(b)' matrix WG=WG[13,] matrix Vwg=e(V) matrix Vwg=Vwg[13,13] matrix h1=(re-wg)'*inv(vwg-vre)*(re-wg) matrix list h1 symmetric h1[1,1] y1 y ********************************************************************* c What happens if we use the alternative version? This we can only do manually ********************************************************************* * TRY OTHER VERSION OF THE TEST xtreg n w k ys, be Between regression (regression on group means) Number of obs = 828 Group variable (i): id Number of groups = 138 R-sq: within = Obs per group: min = 6 between = avg = 60 overall = max = 6 F(3,134) = sd(u_i + avg(e_i))= Prob > F = n Coef Std Err t P> t [95% Conf Interval] w k ys _cons matrix BG=e(b)' matrix BG=BG[13,] matrix Vbg=e(V) matrix Vbg=Vbg[13,13] matrix h2=(bg-wg)'*inv(vwg+vbg)*(bg-wg) matrix list h2

8 symmetric h2[1,1] y1 y ********************************************************************* We get a different result (although the two statistics are asymptotically identical) because the identity proved by Hausman & Taylor (1981) in Econometrica is valid asymptotically In small samples the two tests can differ because of the same problem discussed above (the different estimates of σ ν ) In fact, if you force Hausman to use the estimate of σ ν from the fixed-effect model, you get the same result: ********************************************************************* hausman fixed random, sigmaless ---- Coefficients ---- (b) (B) (b-b) sqrt(diag(v_b-v_b)) fixed random Difference SE w k ys b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(3) = (b-b)'[(v_b-v_b)^(-1)](b-b) = 6640 Prob>chi2 = ********************************************************************* For your curiosity, you can construct these tests manually also using Stata s matrix language Mata Here is how to do it (only the alternative version of the test is computed here): ********************************************************************* ******* REDO THIS LAST TEST WITH MATA mata mata (type end to exit) : : BG=st_matrix("BG") : Vbg=st_matrix("Vbg") : WG=st_matrix("WG") : Vwg=st_matrix("Vwg") : : (BG-WG)'*invsym(Vwg+Vbg)*(BG-WG)

9 : : end -- *********************************************************************

10 QUESTION 3 First of all, to make sure you have all the software pieces needed, you may want to update your Stata and download the necessary user-written files: ********************************************************************* (output omitted) update all search xtabond2, all net install xtabond2pkg, replace search ivreg2, all net sj 5-4 st0030_2 net install st0030_2pkg, replace ********************************************************************* a Select only the first three years of data and run difference GMM using xtabond2 ********************************************************************* use abdatanew, clear keep if wave<=3 (414 observations deleted) tsset id wave panel variable: id (strongly balanced) time variable: wave, 1 to 3 * once the data have been tsset we can use the lag (l) and the difference (d) operators ****** * 1 DIFFERENCE GMM WITH T=3 ***** xtabond2 n ln, gmm(ln) noleveleq Favoring space over speed To switch, type or click on mata: mata set matafavor speed, perm Dynamic panel-data estimation, one-step difference GMM Group variable: id Number of obs = 138 Time variable : wave Number of groups = 138 Number of instruments = 1 Obs per group: min = 1 Wald chi2(1) = 001 avg = 100 Prob > chi2 = 0940 max = 1 n Coef Std Err z P> z [95% Conf Interval] n

11 L Arellano-Bond test for AR(1) in first differences: z = Pr > z = Arellano-Bond test for AR(2) in first differences: z = Pr > z = Sargan test of overid restrictions: chi2(0) = 000 Prob > chi2 = (Not robust, but not weakened by many instruments) ********************************************************************* The model in first differences has only one observation per firm so there is no test for autocorrelation in the disturbances (either first or second order) Moreover, there is only one instrument so there is no test for over-identifying restrictions The syntax of xtabond is slightly different The same model can be obtained as follows: ********************************************************************* xtabond n, noc note: the residuals and the L(1) residuals have no obs in common The AR(1) is trivially zero note: the residuals and the L(2) residuals have no obs in common The AR(2) is trivially zero Arellano-Bond dynamic panel-data estimation Number of obs = 138 Group variable (i): id Number of groups = 138 Wald chi2() = Time variable (t): wave Obs per group: min = 1 avg = 1 max = 1 One-step results Dn Coef Std Err z P> z [95% Conf Interval] n LD Sargan test of over-identifying restrictions: chi2(0) = 000 Prob > chi2 = Arellano-Bond test that average autocovariance in residuals of order 1 is 0: H0: no autocorrelation z = Pr > z = Arellano-Bond test that average autocovariance in residuals of order 2 is 0: H0: no autocorrelation z = Pr > z = ********************************************************************* The point estimates are identical but the standard errors are slightly different because xtabond reports by default the small-sample version of the estimators variances (indeed, you can force xtabond2 to use small sample adjustments wit the small option) b The same results can also be replicated with simple ivreg by taking the model in first-differences and instrumenting the lagged difference with the dependent variable lagged twice

12 ********************************************************************* ivreg2 dn (ldn=l2n), noc Instrumental variables (2SLS) regression Number of obs = 138 F( 1, 137) = 001 Prob > F = Total (centered) SS = Centered R2 = Total (uncentered) SS = Uncentered R2 = Residual SS = Root MSE = 0789 Dn Coef Std Err z P> z [95% Conf Interval] n LD Anderson canon corr LR statistic (identification/iv relevance test): 3343 Chi-sq(1) P-val = Sargan statistic (overidentification test of all instruments): 0000 (equation exactly identified) Instrumented: LDn Excluded instruments: L2n ********************************************************************* c Now, repeat difference GMM with all available time periods Below I do it both with xtabond and xtabond2 ********************************************************************* use abdatanew, clear xtabond2 n ln, gmm(ln) noleveleq Favoring space over speed To switch, type or click on mata: mata set matafavor speed, perm Dynamic panel-data estimation, one-step difference GMM Group variable: id Number of obs = 552 Time variable : wave Number of groups = 138 Number of instruments = 10 Obs per group: min = 4 Wald chi2(1) = avg = 400 Prob > chi2 = 0000 max = 4 n Coef Std Err z P> z [95% Conf Interval] n L Arellano-Bond test for AR(1) in first differences: z = -483 Pr > z = 0000 Arellano-Bond test for AR(2) in first differences: z = -275 Pr > z = 0006 Sargan test of overid restrictions: chi2(9) = Prob > chi2 = 0000

13 (Not robust, but not weakened by many instruments) xtabond n, noc Arellano-Bond dynamic panel-data estimation Number of obs = 552 Group variable (i): id Number of groups = 138 Wald chi2() = Time variable (t): wave Obs per group: min = 4 avg = 4 max = 4 One-step results Dn Coef Std Err z P> z [95% Conf Interval] n LD Sargan test of over-identifying restrictions: chi2(9) = Prob > chi2 = Arellano-Bond test that average autocovariance in residuals of order 1 is 0: H0: no autocorrelation z = -483 Pr > z = Arellano-Bond test that average autocovariance in residuals of order 2 is 0: H0: no autocorrelation z = -275 Pr > z = ********************************************************************* The instruments in levels (for the model in differences) that we are using now require the errors to be uncorrelated over time within individuals If there is some autocorrelation (as the tests suggests there is on average ) then we should drop the instruments that are too close (for example, start using the level at the second or third lag) The cost of dropping the closest instruments, however, is that these are the strongest one in terms of predictive power for the endogenous variable Recall that, if we had some strictly exogenous or pre-determined variables, then we could use them as instruments for the endogenous lagged dependent variable These conditions do not require the errors to be uncorrelated over time d With ivreg we cannot reproduce the same estimates Let s first do the best we can with ivreg It is important to be careful in how we construct the matrix of instruments: ********************************************************************* use abdatanew, clear tsset id wave panel variable: id (strongly balanced) time variable: wave, 1 to 6 forvalues yr=2/6 { 2 local m=`yr'-1 3 forvalues lag=2/`m' { 4 quietly generate z`yr'l`lag'=l`lag'n if wave==`yr' 5 }

14 6 } quietly recode z* (=0) g d_n=dn (138 missing values generated) g dl_n=dln (276 missing values generated) order id wave n d_n dl_n z* ivreg2 dn (ldn=z*), noc Instrumental variables (2SLS) regression Number of obs = 552 F( 1, 551) = 5304 Prob > F = Total (centered) SS = Centered R2 = Total (uncentered) SS = Uncentered R2 = Residual SS = Root MSE = 1405 Dn Coef Std Err z P> z [95% Conf Interval] n LD Anderson canon corr LR statistic (identification/iv relevance test): Chi-sq(10) P-val = Sargan statistic (overidentification test of all instruments): Chi-sq(9) P-val = Instrumented: LDn Excluded instruments: z3l2 z4l2 z4l3 z5l2 z5l3 z5l4 z6l2 z6l3 z6l4 z6l5 ********************************************************************* These estimates are different from the ones produced by xtabond2 because in this second case Stata performs efficient GMM, that is it uses the sample variancecovariance matrix of the moment conditions as a weighting matrix This is the only true difference, which means that the ivreg and the xtabond2 estimates are asymptotically indistinguishable To check this, we can force xtabond2 to assume homoskedasticity of the error terms and thus produce the same results as ivreg The xtabond2 option that controls the weighting matrix is h( ) ********************************************************************* xtabond2 n ln, gmm(ln) noleveleq h(1) Favoring space over speed To switch, type or click on mata: mata set matafavor speed, perm Dynamic panel-data estimation, one-step difference GMM Group variable: id Number of obs = 552

15 Time variable : wave Number of groups = 138 Number of instruments = 10 Obs per group: min = 4 Wald chi2(1) = 5314 avg = 400 Prob > chi2 = 0000 max = 4 n Coef Std Err z P> z [95% Conf Interval] n L Arellano-Bond test for AR(1) in first differences: z = -196 Pr > z = 0050 Arellano-Bond test for AR(2) in first differences: z = -211 Pr > z = 0035 Sargan test of overid restrictions: chi2(9) = 5920 Prob > chi2 = 0000 (Not robust, but not weakened by many instruments) ********************************************************************* e System GMM is only available with xtabond2 Here is how it can be produced (it is actually the default in xtabond2) ********************************************************************* use abdatanew, clear xtabond2 n ln, gmm(ln) noc Favoring space over speed To switch, type or click on mata: mata set matafavor speed, perm Dynamic panel-data estimation, one-step system GMM Group variable: id Number of obs = 690 Time variable : wave Number of groups = 138 Number of instruments = 14 Obs per group: min = 5 Wald chi2(1) = avg = 500 Prob > chi2 = 0000 max = 5 n Coef Std Err z P> z [95% Conf Interval] n L Arellano-Bond test for AR(1) in first differences: z = -672 Pr > z = 0000 Arellano-Bond test for AR(2) in first differences: z = -261 Pr > z = 0009 Sargan test of overid restrictions: chi2(13) = Prob > chi2 = 0000 (Not robust, but not weakened by many instruments) Difference-in-Sargan tests of exogeneity of instrument subsets: GMM instruments for levels Sargan test excluding group: chi2(9) = Prob > chi2 = 0000 Difference (null H = exogenous): chi2(4) = Prob > chi2 = 0000 ********************************************************************* f As in question d), we cannot replicate the results with ivreg ********************************************************************* g model="diff"

16 g lhs=dn (138 missing values generated) g rhs=ldn (276 missing values generated) order model id wave n d_n lhs rhs z* keep if wave>2 (276 observations deleted) keep model id wave n d_n lhs rhs z* sort id wave save dmodeldta, replace file dmodeldta saved use abdatanewdta, clear forvalues yr=2/6 { 2 local m=`yr'-2 3 forvalues lag=0/`m' { 4 quietly generate Dz`yr'L`lag'=l`lag'd1n if wave==`yr' 5 } 6 } quietly recode Dz* (=0) keep id wave n Dz* g lhs=n g rhs=llhs (138 missing values generated) g model="level" g d_n=dn (138 missing values generated) order model id wave n d_n lhs rhs Dz* keep if wave>1 (138 observations deleted) append using dmodeldta keep id wave n d_n lhs rhs z* Dz* model order model id wave n d_n lhs rhs z* Dz* sort model id wave quietly recode z* Dz* (=0)

17 ivreg2 lhs (rhs=z* Dz*L1), noc Instrumental variables (2SLS) regression Number of obs = 1242 F( 1, 1241) = Prob > F = Total (centered) SS = Centered R2 = Total (uncentered) SS = Uncentered R2 = Residual SS = Root MSE = 2016 lhs Coef Std Err z P> z [95% Conf Interval] rhs Anderson canon corr LR statistic (identification/iv relevance test): Chi-sq(14) P-val = Sargan statistic (overidentification test of all instruments): Chi-sq(13) P-val = Instrumented: rhs Excluded instruments: z3l2 z4l2 z4l3 z5l2 z5l3 z5l4 z6l2 z6l3 z6l4 z6l5 Dz3L1 Dz4L1 Dz5L1 Dz6L1 ********************************************************************* but we can force xtabond2 to assume homoskedasticity: ********************************************************************* use abdatanewdta, clear xtabond2 n ln, gmm(ln) h(1) noc Favoring space over speed To switch, type or click on mata: mata set matafavor speed, perm Dynamic panel-data estimation, one-step system GMM Group variable: id Number of obs = 690 Time variable : wave Number of groups = 138 Number of instruments = 14 Obs per group: min = 5 Wald chi2(1) = avg = 500 Prob > chi2 = 0000 max = 5 n Coef Std Err z P> z [95% Conf Interval] n L Arellano-Bond test for AR(1) in first differences: z = -354 Pr > z = 0000 Arellano-Bond test for AR(2) in first differences: z = -147 Pr > z = 0143 Sargan test of overid restrictions: chi2(13) = 4874 Prob > chi2 = 0000 (Not robust, but not weakened by many instruments) Difference-in-Sargan tests of exogeneity of instrument subsets: GMM instruments for levels Sargan test excluding group: chi2(9) = 2092 Prob > chi2 = 0013

18 Difference (null H = exogenous): chi2(4) = 2782 Prob > chi2 = 0000 ********************************************************************* then, we get the same point estimates Not the same standard errors, though In fact, in this particular case, even under homoskedasticity the correct variancecovariance matrix of the iv model would not be a simple identity matrix because the equations in levels and in difference are obviously correlated

Problem Set 10: Panel Data

Problem Set 10: Panel Data 1. Read in the data set, e11panel1.dta from the course website. This contains data on a sample or 1252 men and women who were asked about their hourly wage in two years, 2005