Séminaire d Analyse Economique III (LECON2486) Capital humain, développement et migrations: approche macroéconomique (Empirical Analysis - Static Part) Frédéric Docquier & Sara Salomone IRES UClouvain Février-Juin 2015
STATA In order to perform your empirical analysis you are supposed to be at least STATA beginners, which implies: Having access to a STATA program (the commands we are using during this course are from STATA 11) Being able to open it and recognize windows and icons Being familiar with basic commands (such as list, des, sum, gen...) Being able to use and save a dataset Knowing how to prepare a do.file However, if a command is unknown or not clear do not hesitate to ask STATA itself with the help command or myself (sara.salomone@uclouvain.be)!!!!
STATA
The dataset The dataset at your disposal in.dta file is made up of panel and cross-sectional variables. The first ones (migration stocks, for example) take into account two dimensions: 1 The geographical one (with 195 countries), where i refers to each country 2 The year of interest (9 periods from 1970 to 2010 every 5 years), where t refers to the time dimension While cross-section variables just considers dimension 1. It implies that cross-sectional variables are constant over time for each specific country (proportion of religious groups, for example).
The panel dataset Advantages of panel data with respect to cross sectional data: You can identify dynamical aspects You can control for unobserved heterogeneity that is constant over time You can address endogeneity issues also with internal instruments (lagged values of the independent variables)
Data Editor in STATA
Do file in STATA
Panel Data in STATA The TSSET command declares the data in memory to be a panel and allows you to construct the time-series operator L. (lag). In the dataset at your disposal 1 time lag corresponds to 5 years since you have data for 1970,1975,1980,1985,1990,1995,2000,2005 and 2010. STEPS TO ORGANIZE THE PANEL DATA IN STATA: egen year idbis= group(year) (To enable STATA to identify a five year time span as 1 lag) tsset newcountry year idbis (to declare the data in memory to be a panel) gen lagvarlist=l.varlist (to create a 1 period lagged variable) gen lag2varlist=l2.varlist (to create a 2 period lagged variable)
Panel Data description in STATA To see if the panel is balanced or unbalanced: xtdescribe To have an idea of the overall (total) variability: xtsum varlist To decompose the total variability into the within or intra-individuals variability (σ e ) and the between or inter-individuals variability(σ u ): xtreg varlist,fe
The model The causal relationship between y and z can be identified through two models to which instrumental variable techniques are applied: 1 STATIC MODEL: y i,t = α 0 + α 1 2 DYNAMIC MODEL: y i,t = α 0 + α 1 z i,t }{{} endogenous z i,t }{{} endogenous +γx i,t + ε i,t (1) +β y i,t 1 }{{} +γx i,t + ε i,t (2) endogenous During this course we will just focus on static models with endogeneity.
Endogeneity z i,t is presumably endogenous (i.e. correlated with ε i,t ) because of at least one of the following reasons: 1 Reverse causality or economic simultaneity: z i,t is generated inside the same economic system also generating y i,t 2 Omitted variable: the original model is mispecified since at least one explanatory is missing 3 Measurement error: z i,t is measured with error If this is the case, the α 1 becomes biased and inconsistent. So an instrumentation strategy needs to be implemented.
Instruments validity An instrument is an external (neither equal to z i,t nor to X i,t ) or internal (lagged values of z i,t ) variable which has to be: 1 Relevant: correlated with the endogenous variable z i,t 2 Exogenous: uncorrelated with the error term ε i,t Unfortunately, you cannot statistically check for exogeneity but for relevance various first-stage results and identification stats should be taken into consideration.
Instrumentation tests An instrument is relevant if: The First Stage F-stat is higher than 10 (as a rule of thumb) The Stock-Yogo weak ID test critical value exceeds 16.38 The Hausman or endogeneity test is correctly specified (H 0 : z i,t is exogenous) The Hansen J-test does not reject the over-identification strategy (H 0 : the model is overidentified) If more than one instrument are used.
Static Model: y i,t = α 0 + α 1 z i,t + γx i,t + ε i,t It captures the long run causal relationship between z i,t and y i,t where z i,t needs to be instrumented if endogenous ESTIMATION TECHNIQUES: 1 Pooled OLS 2 Fixed effect (FE) estimation 3 Random effect (RE) estimation
POLS properties Estimation command: regress y z X without instruments ivreg2 y (z=instr) X, with instruments robust }{{} ffirst }{{} endog(z) }{{} heterosk.correction instr.tests HausmanTest This standard multiple regression model implies: 1 An homogeneous behaviour of different countries in both slope and intercept. This can be checked graphically: twoway (scatter y z) (lfit y z)
POLS properties 2 Homoskedasticity (the error variance is constant). The Breusch and Pagan test and a graphical analysis detect it: quietly regress y z X (which replicates the estimation without showing the table) predict ŷ (to predict and store the fitted value of y) predict residuals,res (to predict and store residuals) hettest residuals (H 0: constant variance) twoway scatter ŷ residuals (random distr homoskedasticity) To correct for heteroskedasticity: regress y z X,robust
POLS properties 3 Errors are serially uncorrelated. To deal with serial correlation, perform the LM test on the past value of residuals to see whether they are correlated with contemporaneous values: gen lagres=l.residuals (to create the first lagged value of residuals) regress residuals z X lagres (If φ lagres is significant at 1% there is autocorrrelation) testparm φ lagres ( H 0 :φ lagres =0) 4 Valid model specification. The model should include all the relevant variables and exclude irrelevant ones: quietly regress y z X ovtest (H 0 : the model has no omitted variables)
From overall to within and between variability The restriction related to the hypothesis of homogeneous behaviour of different countries in both slope and intercept is rarely admissible: different countries in different historical phases could have followed different policies The panel structure that distinguishes within and between variability (in the individual, temporal or both dimension) is fully ignored A misleading non linearity may result from POLS (which is equivalent to define a large cross-section of n i.i.d observations)
FE estimation (one way) Fixed effect estimation (one way) exploits the within (or intra countries) variability The within variability is the effect on y i,t of the deviation from the mean over time of the z i,t As a result variables which are constant over time (ex.language,small island nature, colonial links,latitude/longitude,..) are removed The individual effect can be correlated with the z i,t and X i,t variables Estimation commands: xtregress y z X,fe vce(robust) without instruments }{{} heterosk.correction xtivregress y (z=instr1 instr2 instr3) X,fe vce(robust) }{{} first with instruments instr.tests
Help xtreg
FE estimation (one way) and (two ways) The same FE estimates can be obtained through a least squares dummy variable model (LSDV) with the use of dummy variables: 1 One way model: which includes only one set of dummy variables (country) 2 Two ways model: which considers two sets of dummy variables (country and year) To create country and year dummies in STATA: egen country id=group(country) xi i.country, prefix ( C) egen year id=group(year) xi i.year, prefix ( Y) LSDV ESTIMATION (two ways): 1 reg y z X }{{} C countryfe }{{} Y,robust without instruments yearfe 2 ivreg2 y z (instr1 instr2 instr3) X C Y,robust with instruments
RE estimation The individual effects are considered as random (which is a plausible assumption when there are many individuals randomly drawn from a large population, and the specific nature of the individual heterogeneity is unknown) The individual effect is a part of the model s error, thus it must be uncorrelated with the z i,t and X variables. Estimation commands: xtregress y z X,re instruments theta }{{} vce(robust) without heterogeneity xtivregress y (z=instr1 instr2 instr3) X,re theta vce(robust) ffirst endog(z) with instruments
POLS, FE or RE? If θ=1 RE FE (max heterogeneity) If θ=0 RE POLS (few heterogeneity) POLS or RE? qui xtreg y z X,re xttest0 (Breusch and Pagan test for RE where H 0 :no individual heterogeneity) If H 0 is rejected RE estimates must be used instead of POLS
Weaknesses of FE: FE or RE? 1 It cannot estimate the effect of time invariant vrb 2 The residuals are autocorrelated Weaknesses of RE: 1 Assumes exogeneity 2 Heteroskedasticity Hausman test by hand to select the best option: qui xtreg y z X,fe est store FE qui xtreg y z X,re est store RE hausman FE.,sigmamore (H 0 :difference in coefficient not systematic)