Robust Standard Errors to spatial and time dependence when. neither N nor T are very large. Lucciano Villacorta Gonzales. February 15, 2014.

Size: px

Start display at page:

Download "Robust Standard Errors to spatial and time dependence when. neither N nor T are very large. Lucciano Villacorta Gonzales. February 15, 2014."

Polly Houston
5 years ago
Views:

1 Robust Standard Errors to spatial and time dependence when neither N nor T are very large Lucciano Villacorta Gonzales February 15, 2014 Abstract In this article, I study two dierent approaches (two-way cluster and spatial - time autoregressive models) to consistently estimating the variance of the OLS estimator of panel models that are robust ( to spatial and time dependence The speed of convergence of the two-way cluster estimator N, ) is min T More important, the variance of the cluster estimator is also aected by the two forms of dependence As a consequence, when T and N are not large enough, inference based on the two way cluster estimator may be misleading I also show that the two way cluster estimator can be expressed in terms of a fully exible spatial - time autoregressive model As such, if we have a prior idea of the dependence structure we can gain a lot in terms of precision and reduce the probability of type I error Finally, I study the implications of both type of dependence on a state-year panel data of wage inequality and minimum wage in US When both types of error dependence are considered in the variance estimator, the marginal eect of the minimum wage over wage inequality is not longer signicant Keywords: Panel, spatial dependence, time dependence, two way cluster, spatial and time autoregressive model JEL codes: C230 *I am deeply thankful to Manuel Arellano and Stephane Bonhomme for their guidance and extensive discussions, which provide me a better understanding of econometrics Errors and omissions are exclusively my own responsibility Support from the European Research Council/ ERC grant Estimation of non linear panel models with unobserved heterogeneity (grant agreement n ) is gratefully acknowledged CEMFI Casado del Alisal 5, E-28014, Madrid, Spain luccianovillacorta@gmailcom, luccianovillacorta@cemfies 1

2 1 Introduction Panel data models have been increasingly used in applied econometrics They oer the possibility to control for some kind of endogeneity without the need of exogenous instruments and allows to consider richer models, including dynamics, time eects and factor structures The classical literature on panel data has allowed time series dependence, but has generally assumed that the observations are cross-sectionally independent However, there are several settings where cross sectional independence is not a suitable assumption Examples of this are models where the cross-sectional units are nonrandom sample of states, countries or industries, as these units are likely to be connected through neighboring eects, economic trade or linkage in productions, etc 1 In setups like this, where the observable and the unobservable variables of the model present time and spatial dependence, the variance of the OLS estimator of the model parameters will be a function of these dependence 2 Although it is well understood that not taking into account the error dependence will invalid statistical inference 3, computing a variance estimator that is robust to both type of dependence is not common in applied econometrics In particular Bertrand et al (2004), Hoeschle (2007) and Drummond et al (2009) provide evidence of a wide range of empirical applications where the error term of a model could exhibit one or more forms of dependence which are usually not taken into account or corrected in the right way Nevertheless, corrections for the two types of dependence are now available in the literature, but suer from a number of limitations In this article, I study dierent approaches to consistently estimating the variance of the OLS estimator that are robust to spatial and time dependence in panels in which neither the sample size of the cross section N nor the sample size of the time series T are large enough Popular examples are state-year panels, industry-year panels or country - year panels, where 1 Example of these are Barrios et al (2011), Watson et al (2011), among others 2 As well as the variances of other estimators such as IV, GMM, etc 3 In most of the situations the correlation is positive, therefore, ignoring the correlation leads to an underestimation of standard errors increasing dramatically the probability of type I error Cameron et al (2011) show examples where standard errors increased more than threefold when both types of error dependence were taken into account 2

3 the number of states, industries and countries are limited and the availability of time series data is not large enough 4 By far the most common approach put forward in the literature to estimate a variances that is robust to both type of dependence, is the estimator proposed by Cameron et al (2011) and Thompson (2011) This estimator is totally non parametric and extends the one-way cluster approach of Arellano (1987) to a multiway clustering that takes into account error dependence in more than one dimension In the panel data setup with time and spatial dependence it is possible to apply a two way clustering procedure, decomposing the total variance in two terms: (i) one that only contains the spatial correlation, which is estimated by grouping the data in time clusters and (ii) one that only contains the time correlation, which is estimated by grouping the data in spatial clusters The main advantage of this approach is that it does not require any specication for the variance covariance matrix, allowing for general forms of dependence Nevertheless, in order to have an accurate approximation, this variance estimator needs a suciently large number of both types of clusters, since its asymptotic properties are fullled when the number of clusters goes to innity Moreover, the rates at which each cluster estimator -spatial cluster and time cluster- converges to its asymptotic normal distribution are N and T respectively, instead of the OLS rate of NT As a consequence, test based on the the two way clustering estimator perform poorly when N and T are not large enough In settings where N and T are not large enough, the gain of imposing some structure in the error dependence could be important For instance, if we limit the dependence with some mixing assumption we can consider alternative estimators with better small sample properties than the two way cluster estimator For example, it is possible to consider a double Heteroskedasticity and Autocorrelation Consistent variance estimator (HAC) With a mixing assumption in the time series, we can use the HAC estimator of Newey West to take into account the time dependence, analogously with a mixing assumption in the cross section we can estimate a spatial HAC version as in Conley (1999) or Kelejian and Prucha (2003) Nevertheless, this semi-parametric approach, still needs a suciently large sample in order to performs accurately 4 In this kind of setups, where the number of observation is limited the variance is a rst order concern 3

4 Another possibility, that could be more convenient in small samples, is to impose more structure modeling the dependence in a parametric way This approach seems to be less common in empirical applications, since specifying a spatial model is not straightforward (they need a number of specics as a weighting matrix, etc), also, if the model is misspecied, the properties of these procedure are in general unknown In general, it is well known that there is a trade-o between poorly small sample properties but exibility of the non parametric approach and better small sample behavior but sensitivity to model specication of the parametric one Nevertheless, in some applications there exists some prior knowledge about the dependence, that might be useful in order to compute more precises estimates of the variance The purpose of this paper is threefold First, I study the properties of the two way clustering estimator, both analytically and numerically The speed of convergence of the two-way cluster estimator ( N, ) is min T More important, the variance of the cluster estimator (the variance of the variance estimator) is also aected by the two forms of dependence As a consequence, when T and N are not large enough, inference based on the two way cluster estimator may be misleading Second, I make a connection between the cluster estimator and the spatial autoregressive models (SAR) The two way clustering approach can be expressed in terms of a fully exible spatial and time autoregressive model Thus, estimating the OLS variance with a parametric model for the error term can be seen as a particular case of the cluster estimator As such, if we have a prior of the dependence structure we can gain a lot in terms of precision and reduce the probability of type I error In addition, I propose a non linear least square estimation of the spatial and time autoregressive model that is consistent and computationally ecient under the presence of xed and time eects in panels which neither the sample size of the cross section N nor the sample size of the time series T are large enough Third, I study the implications of error dependence in a US state-year panel data of wage inequality and minimum wage with sample size of N = 50 and T = 30 Also, I use this empirical application to study the small sample behavior of the non parametric and the parametric approaches Using Monte Carlo simulations, I found a probability of type I error of around 70% instead of the nominal size of 4

5 5% when error dependence is ignored The two way cluster approach has 16% of type I error when the nominal size of the test is 5% On the other hand, using standard errors which incorporate the dependence in a parametric way, leads to attain the nominal size of the test The standard errors of the OLS estimator of the marginal eect of eective minimum wage over inequality, increases by 200% when I use a one way-clustering (either by time or state) and by 300% when I use a two way-clustering approach, rather than without cluster Despite this fact, the estimated marginal eect still remains signicant at the 5% level However, when I use the parametric variance to control for the dependence, the marginal eect of the minimum wage over US state inequality loses the signicance Finally the standard error of the estimated average marginal eect by 2SLS, increases by 100% and by 200% when I use the two way cluster approach and the parametric approach, respectively Using either the two way clustering or the parametric variance to account for the dependence, the 2SLS estimation of the marginal eect of the minimum wage over US state inequality is no longer signicant 5

6 2 A panel model with spatial and time dependence Consider a linear panel regression model dened by: y it = x it β + α i + δ t + u it (21) where i = 1,, N indexes individuals, t = 1,, T indexes time, x itis a vector of observables covariates, α i is an individual xed eect that is constant over time, δ t is a time eect that is common across individuals and u it is an unobservable error component For simplicity, as in Hansen (2007) I work with a transformation of the variables in order to remove the nuisance parameters α i and δ t y it = x itβ + u it (22) where the variables in (22) are the variables in (21) deviated with respect to their own individual sample mean and time sample mean A 1: E (u it x it ) = 0 The OLS estimator of β from equation (22) is dened as ˆβ = ( N i T t x itx it) 1 ( N i T t x ity it ) Under assumption 1 and regularity conditions, ( ) d d NT ˆβ β N (0 V ), where V is the asymptotic ( N variance of the OLS estimator and is equal to Q 1 W Q 1 1, where T Q = lim dnt NT i t x itx it) and [ ( 1 W = lim dnt NT V AR N )] T i t x itu it Until this moment I have not established the correlation structure of u it In fact, the speed of convergence of the OLS estimator d NT and the form of W and V will depends on the correlation structure in u it and x it In the more general case of dependence: ( W = 1 N NT V AR i ) T x it u it t = 1 NT ( N T V AR i=1 t ) N 1 x it u it + 2 N i=1 j=i+1 ( T COV x it u it t T t x jt u jt ) 6

7 A 2 : E (u it u jt ) 0, E (u it u is ) 0, E (u it u js ) = 0 Assumption 2 implies spatial dependence across individuals in the same time and time dependence for each individual, but rules out correlation between dierent individuals in dierent moments in time Under assumption 2 the expression for W will be the following: ( W = 1 N T ) N 1 V AR x it u it + 2 NT i=1 By linearity of the covariance t N i=1 j=i+1 COV ( T t x it u it x jt u jt ) ( W = 1 N T ) V AR x it u it + NT i=1 t T 2 t N 1 N i=1 j=i+1 COV (x it u it x jt u jt ) If we add and subtract N i T t V AR (x itu it ) we can express W as: [ W = 1 N ( T ) V AR x it u it + NT i=1 t ( T N ) V AR x it u it t t N i ] T V AR (x it u it ) t Let dene to dierent ways of group the data: U i = u i1 u i2 and U t = u t1 u t2 u it u tn W = 1 NT N i=1 E (X iu i U ix i ) + 1 NT T t E (X tu t U tx t ) 1 NT N i T V AR (x it u it ) I have separated W in three parts: a rst one that only take into account the time dependence, a second one that take into account only the spatial dependence and a third part that only takes into account the variance of each observation t W = 1 N N E (X iω ix i ) + 1 T i=1 T t E (X tω tx t ) 1 NT N i T V AR (x it u it ) t 7

8 Where Ω i = 1 T E [U iu i X i] and E (X i Ω i X i) = 1 T T t=1 E ( x it x it it) u T 1 T T t=1 s=t+1 E (u itu is x it x is ) and Ω t = 1 N E [U tu t X t ] and E (X tω tx t ) = 1 N N i=1 E ( x it x it u2 it) +2 1 N N 1 N i=1 j=i+1 E ( u it u is x it x jt) Notice that when N, if we do not have weak dependence in the cross section, W will not be nite, the term lim N 2 1 N 1 N N i=1 j=i+1 E ( ) u itu isx itx jt and we would not be able to dene an asymptotic distribution for ˆβ unless we add some assumptions about the spatial dependence or we change the formula for the asymptotic variance Therefore, if we do not assume any mixing condition about the dependence, we have to divide W by N in order to obtain a nite variance covariance matrix The case when T and there is no mixing in the time series is exactly the analogues, the term T 1 T t=1 s=t+1 E (uituisxitx is) and we need to divide W by T lim T 2 1 T Before start analyzing estimators of W, which is the main purpose of this section, I will discuss the properties of the OLS estimator of β under dierent assumptions about the error dependence and dierent situations about the sample size of the cross section and the time series Remark 1 : When {N, T } and there is mixing in the spatial and the time dependence, the speed of convergence of the OLS estimator to their asymptotic distribution is d NT = NT Remark 2 : When {N, T } and there is mixing in the cross section but not in the time series, then the OLS estimator is N consistent rather than NT consistent Intuitively, this slower rate of convergence is because the time series data are less informative since we are allowing for general form of time dependence and we learning only from the cross section Remark 3 : When {N, T } and there is mixing in the time series but not in the cross section, then the OLS estimator is T consistent rather than NT consistent Intuitively, this slower rate of convergence is because the cross sectional data are less informative since we are allowing for general form of spatial dependence and we learning only from the time series In the following subsections, I will analyze the properties of dierent estimators of W and V, when we have both spatial and time dependence 8

9 3 Two way cluster estimator: The purpose of this section is study the properties of the two way cluster estimator of V as a robust way of taking in to account the spatial and time dependence in the error term and give some insight about their assumptions For that purpose I extend the results in Hansen (2007), who analyzed the properties of the one way cluster estimator in a panel with only time dependence in the errors, to the two way cluster estimator in a framework with spatial and time dependence In order to obtain a consistent estimator of V, we need to get a consistent estimator of W Once we get Ŵ we are done ( ( 1 N ˆV = NT i 1 ( ( T x it x it)) 1 N Ŵ NT t i T x it x it t )) 1 The idea is to get a consistent estimator of W = 1 N NT i=1 E (X i U iu i X i)+ 1 T NT t E (X tu t U tx t ) T t E ( ) x it x it u2 it Therefore, we need to consistently estimate each of the three parts of W 1 NT N i The two way clustering estimator proposed by Cameron et al (2011) extend the one way cluster estimator of Arellano (1987) to a setting of correlation in two dimensions In fact, Cameron et al (2011), proposed to estimate each of the two rst parts of W by a one way cluster estimator The rst part of W will be estimated by grouping the data by individual clusters and the second part will be estimated by grouping the date by time clusters Notice that the third part of W is the total mean of each of the variances and can be consistently estimated using the White formula : 1 NT N i where û it are the OLS residuals of (22) T t ( ) xit x itû2 Let me discuss a little bit about the insights of this estimator The rst part of W contains the time correlation of each individual: 1 N N i=1 E (X i Ω i X i), where Ω i is a T xt matrix that contains all the time dependence of individual i: Ω i = 1 T E [U iu i X i] And in the more general case of time dependence each of the Ω i contain T (T + 1)/2 dierent elements Nevertheless is important to remark that we do not need to consistently estimate each of the Ω i, 9

10 we only need to consistently estimate an average of this individual clusters: 1 N N i=1 E (X i U iu i X i) This is the same problem as when we work with heteroskedasticty errors Therefore we can extend the ( ) 1 robust variance estimator proposed by White to the data grouped by individual N NT i=1 X iûiû i X i In fact, this is the one way cluster estimator for panel data proposed by Arellano (1987) and studied by Hansen (2007) as: In order to understand the conditions for consistency of this estimator we can express X i U iu i X i X iu i U ix i = E (X iu i U ix i ) + ɛ i where E [ɛ i ] = 0 Then 1 N N X iu i U ix i = 1 N i=1 N E (X iu i U ix i ) + 1 N i=1 N i=1 ɛ i Therefore 1 N N i=1 X i U iu i X i will be a consistent estimator of 1 N N i=1 E (X i U iu i X i) as long as 1 N N i=1 ɛ p i E [ɛ i ] This will happen only when N Given assumption 1, we can replace the unobserved vector U i by the consistent estimator Ûi in the one way cluster formula ( ) 1 N Remark 4: The estimator NT i=1 X iûiû i X i is a function of the weighted average of all the individual-cluster variances, therefore, in order to derive the asymptotic we need the number of cluster N to go to innity Remark 5: When {N, T } the speed of convergence of 1 NT N i=1 ( ) X iûiû i X i to their asymptotic distribution is N (see Theorem 2 and 3 of Hansen (2007)) The second part of W contains the spatial correlation of the error in each moment in time: 1 T T t=1 E (X tω tx t ), where Ω t is a NxN matrix that contains all the spatial dependence in time t 10

11 As in the previous case we do not need a consistent estimator of each Ω t, we need a consistent estimator of 1 T T t=1 E (X tu t U tx t ) Therefore we can group the data by time clusters and compute ) 1 the following expression: T T t (X tûtû tx t The last expression will be a consistent estimator of 1 T T t E (X tu t U tx t ) as long as 1 T T t=1 ɛ t E [ɛ t ]This will happen only when T ) 1 T Remark 6: The estimator NT t (X tûtû tx t is a function of the weighted average of all the time-cluster variances, therefore in order to derive the asymptotic we need the number of cluster T to go to innity Remark 7: When {N, T }, the speed of convergence of 1 NT T t ) (X tûtû tx t to their asymptotic distribution is T p The two way cluster estimator is the following: Ŵ 2W CLUST ER = 1 NT N i=1 (X iûiû ix i ) + 1 NT T t (X tûtû tx t ) 1 NT N i T ( xit x itû 2 it) t where Ûi is a T x1 vector that groups the T residuals for individual i and Ût is a Nx1 vector that groups the N residuals in timet Theorem: Under weak dependence and suitable regularity in the time series and the cross section If A1 and A2 are satised: 1 The speed of convergence of Ŵ 2W CLUST ER is min (N, T ) 2 For Ω t = X tu t U tx t : T [ vec ( Ŵ Clsuter,t 1 NT )] T E (Ω t ) t=1 d 11

12 N ( 0, lim N 1 N 2 T ( T i=1 T 1 V ar (Ω t ) + 2 t=1 s=1 )) T Cov (Ω t, Ω s ) 3 For Ω i = X iu i U ix i : N N 0, lim N 1 NT 2 [ N i=1 vec ( Ŵ Cluster,i 1 NT N 1 V ar (Ω i ) + 2 i=1 j=1 )] N E (Ω i ) i=1 d N Cov (Ω i, Ω j ) Remark 8: The two way cluster estimator needs that both N and T goes to innity, because the rst and the second part of Ŵ 2W CLUST ER are consistent when N goes to innity and when T goes to innity respectively Remark 9a :ŴClsuter,t = 1 NT T t=1 E (Ω t) is not well estimated when the time series presents dependence and T is not very large Remark 9b :ŴCluster,i = 1 NT N i=1 E (Ω i)is not well estimated when the cross section presents dependence and N is not large The speed of convergence of the two-way cluster estimator is min ( N, T ) More important, the variance of the cluster estimator (the variance of the variance estimator) is also aected by the two forms of dependence As a consequence, when T and N are not large enough, inference based on the two way cluster estimator may be misleading It is true that the two way clustering approach is the ideal way of estimating the variance, because do not impose any restriction about the form of the dependence Nevertheless this freedom comes with the requirement of a large number of observations in both the cross section and time series In settings where N and T are not large enough, as in the state-year panels or country - year panels, the gain of imposing some structure in the error dependence could be important For example, a mixing assumption give us the possibility of consider alternative estimators of the variance as the double Heteroskedasticity and Autocorrelation Consistent variance estimator (HAC) With a mixing assumption in the time series, we can use the HAC estimator of Newey West for taking into account the time dependence, and with a mixing assumption in the cross section we can estimate a spatial HAC 12

13 version as in Conley (1999) The double HAC estimator should have better small sample properties than the two way cluster estimator In a recent paper Bester et al (2011) proposed a test based on the cluster estimator when the number of clusters is small Rather than used the normal approximation they derive a limiting distribution for the t statistic, treating the cluster estimator as random variable Nevertheless this approach needs that the number of elements in each cluster is large relative to the number of clusters A third alternative, that could be more convenient when N and T are of the same size and are not large enough, is to impose more structure modeling the dependence in a parametric way The way we should model the dependence will depend on the specic application In particular, Hansen (2007) remarks that the bias of a simple parametric estimators is also typically smaller in the case where the parametric model is correct, making this approach preferable when the researcher is condent about the form of the error process 4 Parametric Approach: 41 The SAR model An alternative approach to deal with the error dependence is to model it in a parametric way, ie imposing some structure to the variance covariance matrix In a small sample framework, this methodology behaves better than clusters, since the variance estimator 5 will be NT consistent ARMA models are pretty known and used to model time dependence, however, models for spatial dependence are not so well known because cross sectional data do not have an order as opposed to time series data Notwithstanding, the interest and developments in spatial econometrics have been increasing in the last 10 years as a consequence of the increasing use of cross sectional data and panels with large N in applied econometrics In this way, models for specifying, estimating and testing spatial dependence have been developed in leading and pioneers works such Anselin (1988,1999), Baltagi et al (2003), 5 If the model is stationary 13

14 Baltagi et al (2010), Kelijian and Prucha(1998,1999), Lee and Yu (2007,2008),etc All of these papers, are based in the spatial autoregressive model (SAR) proposed by Cli and Ord (1973, 1981) whose specication is similar to the time series autoregressive model However, there are several important dierence between them In this section I will only comment three of the most important ones The spatial autoregresive model of order one -SAR(1)- for the error process of equation?? is the following: u it = λ N j=1 wijujt + εit Or in matrix form: U(t) = λw N U(t) + ɛ(t) (41) where U(t) is a Nx1 vector which contain each of the shocks u it for the N observations in time t λis the spatial parameter, ɛ(t) is a Nx1 vector of innovations and W N is a NxN matrix which is determined by the researcher before the estimation Each element w ij allows for direct correlation between two dierent cross sectional units In rst place, due to the fact that spatial units do not have an established order, there is no corresponding concept in the spatial domain, so, maintaining the idea that near realizations of a stochastic process are related, a spatial lag operator w ij is used to allow for correlation between units that are near in distance In this way a spatial unit is a function of a weighted average of other realizations of the same stochastic process at neighborhood locations So the weighted operator w ij or the weighted matrix W N which contains all the weights, plays an important role in modeling spatial dependence This weighting matrix has the following features: 14

15 (i) Each w ij is nonzero when the unit i and the unit j are not so far but tend to zero when the distance between units increases This assumption is crucial to have a nite variance covariance matrix for the asymptotic distribution of the OLS estimator when N goes to innity, like an ergodicity property for a time series process (ii) W N is a non stochastic matrix that has to be dened before the estimation and all the elements of its diagonal are equal to zero: w ij is equal to zero if i = j The rst assumption of this point is to avoid identications problems (is impossible to estimate the NxN elements of W N in a cross sectional setting The second assumption, is a normalization of the model and do not have implications for the estimation (iii) The matrix I N λw N is non singular for all values of λ, which ensures that the process U is invertible and can be expressed as a weighted average of the innovations ɛ(t) (iv) In most of the cases the matrix W N is row standardized in such a way that the sum of each row (column) is equal to one This is only for a fact of interpretation In second place, unlike time series models, the fact that λ is less than one in absolute value does not imply covariances stationary V ar(uu ) = σ 2 [(I N λw N ) (I N λw N )] 1 (42) The elements of the diagonal of V ar(uu ) in 42 are not necessary equal and depend on the structure of W N Moreover, a λ less than one is required to ensure the ergodicity of the process which implies a nite variance of U, λand ˆβ The third and most important dierence between the SAR(1) model in 41 and the AR(1) model, is that the OLS estimator of the autoregressive coecient in a spatial model is inconsistent Unlike time series models, the lag term in a spatial model is an endogenous variable which is correlated with 15

16 the innovations in ɛ(t), independently of the innovations being iid distributed We can see this in the reduced form U(t) = (I N λw N ) 1 ɛ(t), where each shock in the vector U(t) is a combination of all the innovations in the vector ɛ(t) Therefore, the endogeneity and bias of the OLS will depend on the structure of W N and the value of the spatial autoregressive parameter λ An alternative way to understand this problem is to think of a spatial model as a system of N equations, so the OLS will be biased and inconsistent due to the simultaneity bias As a consequence, the spatial lag variable has to be treated as an endogenous variable and the estimation method used for the spatial lag parameter has to exploit valid moment's conditions One way to estimate the spatial autoregressive coecient is via pseudo maximum likelihood assuming that ɛ(t) is normally distributed In this case we can treat U(t) as a multivariate random process which have an unconditional normal distribution: N(0, σ 2 [(I N λw N ) (I N λw N )] 1 ) lnl = N 2 (ln(2π) ln(σ2 )) + ln I N λw N 1 2σ 2 (U (I N λw N ) 1 U) 42 Cluster estimation as a exible model of spatial dependence Let's dene z it = x it u it for a scalar x it for the sake of simplicity And the N 1 vector Z t = [ z 1t z 2t z Nt ] ( 1 1 N T T t=1 z 2 1t + 1 T T t=1 Ŵ Cluster,t = 1 NT z 2 Nt + 1 T T t=1 T Z tz t = t=1 z 1t z Nt + 1 T T z N 1t z Nt + 1 T t=1 T t=1 z 2 Nt ) The cluster estimator robust to spatial dependence can be seen as the sum of each of the sample analogue estimators of the N (N + 1) /2 spatial covariances : 16

17 E ( z 1t) 2 = ρ1,1 E (z 1t z 2t ) = ρ 1,2 E (z 1t z Nt ) = ρ 1,N E ( znt) 2 = ρn,n In the cluster approach we are using the time series to estimate each of the previous moment conditions (Remember that the speed of convergence of the cluster estimator robust to spatial dependence is T ) Z t = Γ 1 /2 ɛ t where, Γ is an N N symmetric matrix that contains the N (N + 1)/2 spatial correlations and ɛ t is a N 1 vector of innovations with E [ɛ t ] = 0, E [ɛ t ɛ t] = I N We can express the model as: Z t = ΛZ t + ɛ t where Λ = I N Γ 1/2 is a NxN matrix The Cluster estimator could be expressed in terms of a particular fully exible spatial autoregressive model of order N (N + 1)/2 Z t = N 1 N i=1 j=i+1 λ ij W C ij Z t + ɛ t where each W C ij are zero λ ij is the ij element of Λ is a NxN matrix in which the element ij is equal to one and the other elements We can also capture the unrestricted dependence implied by the cluster model with other sequences of weighting matrices dierent from W C ij We can dene the following SAR(K) model: 17

18 K Z t = λ k W k Z t + ɛ t k=1 where K is a function of N and {W 1, W 2,, W K } is a sequence of weighting matrices We can mimic the cluster model and capture all the dependence implied by the matrices W C ij by saturating the SAR(K) with K = N(N +1)/2 and with suitable selection of {W 1, W 2,, W K } If K < N(N + 1)/2, but growth with N, when N, this model will be asymptotically equivalent to the cluster model (badly estimated for T not very large) If K is xed the estimator of the variance will learn from the time series and the cross section (the variance estimator will be NT consistent The SAR model of order one can be seen as a special case of the cluster estimator (K = 1) Z t = λw Z t + ɛ t In terms of the cluster, this model implies that the elements of Λ for states that do not share a border are zero and the elements of Λ for states that share a border are equal Note that all the elements in Z t could be correlated ( the states that share a border have a direct interaction while states that do not share a border could be connected indirectly through other common states) If we have a spatial model for the error term instead of a spatial model for z it : U t = λw U t + ξ t where U t is the N 1 vector of errors and ξ t is a N 1 vector of innovations with E (ξ t ) = 0 and E (ξ t ξ t) = σ ξ I N Z tz t = X tu t U tx t Z tz t = σ ξ X t [ (I λw ) (I λw ) ] 1 X t This implies a more restrictive model for Z t because we are assuming homoscedasticity E (U t U t X t ) = E (U t U t) 18

19 43 A Parametric Model for Spatial and Time Dependence: In this subsection I present a model which allows for all kind of dependence in a panel data model This model is a generalization of the SAR model described above but with two additional regressor for control the time dependence and the spatial dependence in dierent moments in time: U t = λ(i T W N )U t + ρ 1U t 1 + ρ 2(I T W N )U t 1 + ɛ t (43) Where U t is a NT x1 vector which contain each u it for the N states and T times, and U t 1 is the same vector but with a lag in the time period In this model, the presence of the term λ(i T W N )U t in 43generates, as in the SAR model, the endogeneity problem, so the OLS estimators remain inconsistent To estimate the unknown parameters of this model θ = [λ, ρ 1, ρ 2, σ 2 ] we can use a conditional pseudo maximum likelihood approach as if it were the reduced form of a VAR model 431 Quasi maximum Likelihood: Dening B N = I N λw N, A = ρ 1I N + ρ 2W Nand A 1 = B 1 A and assuming that BN is invertible we can express 43 in a reduced form : N U t = (I T A 1)U t 1 + (I T B 1 N )ɛ t (44) We know the two rst moment of the conditional distribution of U t/u t 1: 19

20 E[U t/u t 1] = I T A 1)U t 1 V AR[U t/u t 1] = σ 2 (I T (B N B N ) 1 Thus, in this case the conditional expectation of the score is zero and the properties of consistency and asymptotic normality of the estimator will be satised as shown in Yu and Lee Jong (2007) lnl(u(2)u(t ), θ) = NT 2 (ln(2π) ln(σ2 )) + (T 1)ln B N 1 2σ 2 [U t (I T 1 A 1 )U t 1 ] (I T 1 (B N B N ) 1 )[U t (I T 1 A 1 )U t 1 ]) where U t is a Nx(T 1) vector and I T 1is an identity matrix 432 Non stationary case One particularity of this spatial time model is that, even when the autoregresive coecient ρ 1 is less than one in absolute value, the vector U(t) could contain unit roots which implies that there may be non stationary components in the data generating process A non stationary case occurs when some of the eigenvalues of A 1 is equal to one As in Yu, Jong and Lee (2007) we can dene an eigenvalue i of the A 1 as ψ in = ρ 1+ρ 2 ϖ in 1+λϖ in, where ϖ in is the i eigenvalue of the matrix W N Moreover, if W N is row normalized from a symmetric matrix, all its eigenvalues will be greater or equal than one 6, then we could have unit roots in U(t) if λ + ρ 1 + ρ 2 = 1 Nevertheless, if the spatial weight matrix W N is row normalized from a symmetric matrix, we can still obtain the consistency and asymptotic normality of the ML and the QML estimator with the same rate of convergence as in the stationary case, as was demonstrated in Yu, Jong and Lee (2007) But obviously with a dierent variance covariance matrix 433 Monte Carlo Simulations: OLS vs QML In this subsection I simulate a spatial process like U(t) = λw N U(t) + ɛ(t) with N = 50 for dierent values of the spatial parameter λ The purpose of the simulation is to evaluate the performance of OLS 6 See Ord(1975) 20

21 relative to the conditional QML for a spatial model Then I extend the simulation for a spatial-time model as 43 for two dierent values of the parameters in θ I generate one thousand replications of a model with sample size N = 50 and T = 31, where ɛ(t) are generated from independent standard normal distributions The initial value U(1) is generate as N(0, I N ) For each set of generated sample observations I calculate the simulated sample mean and the simulated sample variance of the OLS and the conditional QML estimators This sample mean is 1 constructed as 1000 θ s=1 s and the sample variance as s=1 ( θ s θ 1000 s=1 s) 2 The results are summarized in Graph 1 and Graph 2 and Table N1 21

22 Graphic N1: U(t) = λw N U(t) + ɛ(t) The black line in the graph above shows the true values of the spatial parameter, while the blue line shows the average of the OLS estimator for a thousand replications Red dashed lines are the condence intervals calculated with two standard deviations above the mean From the graph we can see that the OLS estimator bias increases as the value of the spatial parameter increases 22

23 Graphic N2: U(t) = λw N U(t) + ɛ(t) On the other hand, the mean of the conditional QML estimator is very close to the actual values of the spatial parameter In addition, the actual values of the spatial parameter are within the condence interval of the estimator Table N1: U t = λ(i T W N )U t + ρ 1U t 1 + ρ 2(I T W N )U t 1 + ɛ t 23

24 From table N1 we can observe that the mean of the OLS estimator is far from the true value of the parameters representing the spatial dependence In that sense, the OLS estimator has a bias of 011 (two standard deviations) in the spatial parameter in relation to the bias of 002 of the conditional QML(1 standard deviation) This dierence in the biases increases when the true value of the spatial parameter is higher In this way, when the real value of the spatial parameter is 06 rather than 04, the OLS estimator bias is 021 (7 standard deviations) against 002 of QML estimator (2 standard deviations) The average of the OLS estimator of the parameter that represents the time dependence is very close to the true parameter, as the conditional QML estimator, since the time lag does not suer from endogeneity Finally, the OLS estimator of the parameter that measures the dependence between states at dierent moments in time is also biased, because the variable W N U t 1 is strongly correlated with the variable W N U t which has the endogeneity Thus, the higher the endogenity of W N U t is (which depends on the structure of W N and the value of λ) the higher the bias of the OLS estimator of the parameter related to the variable W N U t 1 is, as shown in the table 5 Montecarlo Simulation: The purpose of this section is twofold The rst one is to study the behavior of the two way clustering approach and the parametric approach in a setup where we have spatial and time dependence in the error term of the model and NT is large but N and T separately are not The second one is study and model the error dependence in a specic application and analyze how the results and conclusions could change once we control for several forms of dependence In order to tackle this two purposes, I will focus on a recent paper that studies the causal eect of the minimum wage on US wage inequality, written by Autor, Manning and Smith, from now on, AMS This research uses a state panel data with N = 50 and T = 30 By using OLS and 2SLS estimators and xed and time eects as controls, the authors conclude that the impact of minimum wage over wage inequality in the period is highly signicant I argue that their inferences are not so reliable since they are not controlling for error dependence in a setting where it is very likely In fact, Barrios et al (2010) show that yearly earnings have substantial correlation across states which 24

25 decreases when distance increases This dependence could be explained by geographical or local labor markets features In subsection 31, I briey explain the model in AMS In section 32, I discuss how to estimate a spatial and time autoregresive model to the error term of AMS model and I propose a non linear least square estimation that is consistent and computationally ecient under the presence of xed and time eects in middle panels In section 33, I estimate the parameters of the spatial and time autoregresive model to the error term of AMS In section 34, I generate replications of data with the same characteristics as the AMS's error model in order to evaluate how many times we make type one errors in a regression of a variable with AMS's error structure and the regressors of AMS model (eective minimum wage and its square), for dierent estimators of the variance covariance matrix 51 Autor, Manning and Smith (AMS) In their paper, AMS uses a Panel data of 30 years and 50 states to measure the impact of minimum wage over US wage inequality Their conclusions are based on the estimation of the following regression model: logwage it (10) logwage it (50) = logwage it (10) logwage it (50) + β 1 [logwage Min it logwage it (50)] + β 2 [logwage Min it logwage it (50)] 2 + u it (51) Where wage it(10) wage it(50) is the log of the10 th percentile state wage relative to the log of the median state wage (a proxy of wage inequality) wage it(10) wage it(50) is the latent inequality which is approximated by the sum of a xed and a time eect 7 and [wage Min it wage it(50)] is a measure of the bindingness of the minimum wage for state s in year t, from know on, eective minimum wage Therefore, the estimated marginal eect of the minimum wage over wage inequality is given by the 1 estimation of β 1 + 2β N T 2 NT i=1 t=1 [logwagemin it logwage it(50)] AMS remarks that the estimation of 7 See Autor, Manning and Smith(2010) 25

26 β 1 and β 2 in 51 could be aected by division bias because the variable logwage it(50) appears in both sides of the regression, which induces an articial positive correlation caused by sampling variation For such reason, they use the maximum value between the federal minimum regulatory wage and the state's minimum wage, from know on statutory minimum, as an instrument The idea is that the federal regulatory wage could aect directly the logwage Min it but not logwage it(50) It is reasonable to think that the federal regulatory wage is an exogenous source of variation that is not directly correlated with the state variation of the wage inequality To instrument the square of the eective minimum, they used the square of the predicted value from a regression of the eective minimum on the statutory minimum using year and states dummies as controls In the AMS model, there may be some shocks that we are not controlling for(apart from the xed and the time eect) For example, we could think on a productivity shock that could aect the inequality in each state and in each time As in Beaudry et al(2010), technological shocks could increase the return to high skill workers relative to low skill workers, increasing the inequality in a specic state These productivity shocks could create an error dependence in the model in three dierent dimensions: (i) As in macro models, productivity shocks could follow an autoregressive process, inducing a time dependence in the error term (ii)state specic productivity shocks that are propagated to other states through some linkages across states For example, productivity shocks could quickly propagated between neighboring states and (iii) productivity shocks could generate some spillovers between states at dierent points in time, ie the new technology is transmitted from one state to another state across time 52 Estimation of a Parametric Model for Spatial and Time Dependence in the residual of AMS's model As a rst step I estimated AMS's model in 51 by OLS and 2SLS controlling for time and state eects in order to obtained the residuals As in Autor et al (2010), In order to built the dependent variable, I followed the following steps : (i) First, I grouped all individual responses from the Current Population Survey Merged Outgoing Rotation Group (CPS MORG) for each year (ii) I used the 26

27 reported hourly wage for those who reported being paid by the hour, and, if the individual do not have information of hourly wage, I calculated this variable as weekly earnings divided by hours worked in the prior week (iii) Then, I limited the sample to individuals which has more than 18 year and less than 64 and exclude self employed individuals (v) Finally, in order to reduce the inuence of outliers I replaced the percentile 98 and 99 of the wage distribution in each state and year by the 97 percentile value Using these individual wage data, I computed the percentiles of the state wage distributions by sex for , weighting individual observations by their CPS sampling For the minimum wage and the federal wage I used the data reported in table 1 of the Appendix of Autor et al (2010) The estimation of the residual of the model is given by the following expression: û it = logwage it (10) logwage it (50) ˆδ t ˆδ i ˆβ 1 [logwage Min it logwage it (50)] ˆβ 2 [logwage Min it logwage it (50)] 2 (52) Using these residuals I estimated a Spatial Time model as 43 in order to get an estimation of the parameters which characterize the error dependence of AMS model W N is a normalized matrix (by rows) from a symmetric matrix that takes the value of one if the states share a border and zero otherwise The parameters in θ = [λ, ρ 1, ρ 2, σ 2 ] estimated for the residual term model proposed are obtained both by OLS and conditional QML (residuals come from the estimated errors of the OLS and 2SLS model of AMS) are summarized in Table 2 and Table 3; 27

Table N 2 Table N3 As we can see from Table N2 and Table N3, the OLS and conditional QML estimators of the three parameters are highly signicant with a spatial parameter near to 02 (for the QML

autocorrelation In this setting NxT is large, but actually N and T separately are not, so, it is possible that the residuals estimated from AMS model could be biased, since they come from the

28 Table N 2 Table N3 As we can see from Table N2 and Table N3, the OLS and conditional QML estimators of the three parameters are highly signicant with a spatial parameter near to 02 (for the QML estimator) Moreover the R-squared of the regressions are around 05 and the Likelihood Ratio test for testing the three forms of autocorrelation, reject the null hypothesis of absence of autocorrelation In this setting NxT is large, but actually N and T separately are not, so, it is possible that the residuals estimated from AMS model could be biased, since they come from the estimation of the xed eects and the time eects which are biased for small T and small N respectively To avoid the bias caused by the xed and time eects estimations on the residual regression, I eliminate these eects deviating the variables of the AMS model with respect to its time and state means In this sense, I transform the data with Q 1 and Q 2 in order to eliminate the time and individual eects of the regression Where Q 1 = I N (I T 1 ıt ıt ) is the matrix which deviates the variables with T respect to its time mean, and Q 2 = (I N 1 ın ın ) IT ) is the matrix which deviates the variables with N respect to its state mean Therefore, I should estimate a Spatial-Time model to the new transformed residuals: U t = λ[i T W N ]U t + ρ 1U t 1 + ρ 2[I T W N ]U t 1 + ɛ t (53) 28

29 where U = Q 2Q 1U and ɛ = Q 2Q 1ɛ However, this transformation creates other sources of bias in the model, because E[ɛ i /U t 1] is not zero Therefore, the moments used in the conditional QML for the reduced form of the model will not be valid anymore One way to estimate this model, is to use the unconditional likelihood of U : lnl = NT 2 (ln(2π) ln(σ2 )) + ln (Q 2Q 1ΥQ 2Q 1) 1/2 1 2σ 2 (U(Q 2Q 1ΥQ Q 1) 1 U) where Υ is the variance covariance matrix of the NT vector U: Υ = E u 1t u 1t u 1t u 2t u 1t u Nt u 2t u 1t u 2t u 2t u 2t u Nt u Nt u 1t u Nt u 2t u Nt u Nt u 1t+1 u 1t u 1t+1 u 2t u 1t+1 u Nt u 2t+1 u 1t u 2t+1 u 2t u 2t+1 u Nt u Nt+1 u 1t u Nt+1 u 2t u Nt+1 u Nt u 1T u 1t u 1T u 2t u 1T u Nt u 2T u 1t u 2T u 2t u 2T u Nt u 1t u 1T u 1t u T u 1t u NT u 2t u 1T u 2t u 2T u 2t u NT u Nt u 1T u Nt u 2T u Nt u NT u 1T u 1T u 1T u 2T u 1T u NT u 2T u 1T u 2T u 2T u 2T u NT u NT u 1t u NT u 2t u NT u Nt u NT u 1T u NT u 2T u NT u NT A practical diculty with this unconditional QML is that the estimation of the parameters in θ = [λ, ρ 1, ρ 2, σ 2 ] implies computational complexities This is because the computational estimation of the likelihood in 42, involves a repeated evaluation of the determinant of the NT xnt matrix Υ For this reason, I moved to other methodology which exploit valid moment conditions but is computationally easy 521 Non Linear Least Squares Another way to estimate the parameters in θ without imposing any functional distribution for the vector ɛ(t) is to use all the moment conditions inside Υ = E(U tu t) 29

30 For simplicity I will dene the vector U(t) as the vector of shocks for all the states in time t Therefore, for each time t the parametric model for the error term of AMS will be the following: U(t) = λw N U(t) + ρ 1U(t 1) + ρ 2W N U(t 1) + ɛ(t) or in reduced form: U(t) = A 1U(t 1) + B 1 N ɛ(t) (54) Υ = E U(t)U(t) U(t)U(t + 1) U(t)U(T ) U(t + 1)U(t) U(t + 1)U(t + 1) U(t + 1)U(T ) U(T )U(t) U(T )U(t + 1) U(T )U(T ) This Υ can be expressed as a function of the parameters in θ A 1 is not necessary symmetric and especially does not have all its eigenvalues less than one, even when the U it is time-stationary Therefore each sub matrix of Υ, could depend on the dierence in time We can expressed 54 in its MA form using the lag operator: U(t) = B 1 ɛ(t) + A1B 1 ɛ(t 1) + A2 1B 1 ɛ(t 2) + A3 1B 1 ɛ(t 3) + N In this way if we have the index t = 1, 2, 3T : N N N E[U(1)U(1) ] =(B N B N ) 1 σ 2 I T, E[U(1)U(2) ] =(B N B N ) 1 σ 2 A 1, E[U(1)U(3) ] =(B N B N ) 1 σ 2 A 2 1 ', E[U(1)U(T ) ] = (B N B N ) 1 σ 2 A T 1 1 ' E[U(2)U(1) ] =(B N B N ) 1 σ 2 A 1, E[U(2)U(2) ] = (B N B N ) 1 σ 2 [I N + A 1A 1 ], E[U(2)U(3) ] = (B N B N ) 1 σ 2 [A 1 + A1A2 1 ], E[U(2)U(T ) ] = (B N B N ) 1 σ 2 [A T 2ı 1 + A 1A T 1 1 ] 30

31 E[U(3)U(1) ] = (B N B N ) 1 σ 2 [A 2 1 ], E[U(3)U(2) ] = (B N B N ) 1 σ 2 [A 1 + A 2 1 A 1 ], E[U(3)U(3) ] = (B N B N ) 1 σ 2 [I N + A 1A 1 + A2 1 A2 1 ], E[U(3)U(T ) ] = (B N B N ) 1 σ 2 [A T A 1A T A 2 1 1AT 1 ] E[U(1)U(T ) ] = (B N B N ) 1 σ 2 A T 1 1,E[U(T )U(2) ] = (B N B N ) 1 σ 2 [A T 2 1 A T A 1 ], E[U(T )U(3) ] = (B N B N ) 1 σ 2 [A T 3 1 +A T 2 1 A 1 1 +AT 1 A 2 1 ], E[U(T )U(T ) ] = (B N B N ) 1 σ 2 [I N +A 1A AT 1 A T 1 1 ] Using the residuals that come from the transformation of AMS model, we can minimize the quadratic distance between all the moment conditions inside Q 2Q 1Υ(θ)Q 1Q 2 and the sample counterpart of E[ ˆ U ˆ U ] Dene ζ p = vech( ˆ U ˆ U ), and ξ p(θ) = vech(q 2Q 1Υ(θ)Q 1Q 2), we can estimate θas the nonlinear least squares from the following model: ζ p = ξ p(θ) + υ p where p = NT (NT +1) 2 θ NLS = argmin{[ζ p ξ p(θ)] [ζ p ξ p(θ)]} c This estimator will be consistent but inecient because we are not using a correct weighting matrix for the moments in Υ(θ) 8 To compute standard errors for this θ NLS, I used a Monte Carlo simulation generating replications of the following model: U = λ NLS [I T W N ]U NLS + ρ 1 Uit 1 NLS + ρ 2 [I T W N ]Uit 1 + ɛ it For each simulation I generate a NT vector ɛ from a Normal distribution N(0, σ NLS Q 2Q 1Q 1Q 2) The results of the OLS and NLS estimations of the Spatial Time model for the OLS transformed residuals and 2SLS transformed residuals are summarized in Table 4 and Table 5: 8 For simplicity I am using the Identity Matrix as a weighting matrix instead V AR(ζ p) For use the correct matrix I should know the four moments of U it 31

Table N4 Table N5 The results do not change drastically, all the estimations of the 3 parameters are still highly signicant, but now the estimation of the spatial parameter is higher (around 04) and

these last estimated values of the parameters that characterize the dependence, I will simulate one thousand replications of shocks with time and spatial dependence: U = λ NLS [I T W N ]U NLS + ρ 1

32 Table N4 Table N5 The results do not change drastically, all the estimations of the 3 parameters are still highly signicant, but now the estimation of the spatial parameter is higher (around 04) and the estimation of the lag spatial parameter (around 01) is lower as compared to the conditional QML estimator of the residuals without deviations 53 Simulating AMS model with error dependence Using these last estimated values of the parameters that characterize the dependence, I will simulate one thousand replications of shocks with time and spatial dependence: U = λ NLS [I T W N ]U NLS + ρ 1 Uit 1 NLS + ρ 2 [I T W N ]Uit 1 + ɛ it For each simulation the NT vector ɛ was created from a Normal distribution N(0, σ NLS Q 2Q 1Q 1Q 2) Then, using the real regressors from the AMS model (eective minimum wage and its square) I simulate the dependent variable of each replication: 32

33 y J = β 1Q 2Q 1[logwage Min logwage(50)] + β 2Q 2Q 1[logwage Min logwage(50)] 2 + U J, where J = 1 to 1000 The values of β 1 and β 2 come from the estimation of the AMS model Finally, I compute for each replication the OLS estimators of a regression of y J over Q 2Q 1[logwage Min logwage(50)] and Q 2Q 1[logwage Min logwage(50)] 2 and the dierent variance estimator of the OLS estimator using each of the approaches (Non Parametric and Parametric) individual tests, how many times we reject the true null hypothesis of of condence level for each of the estimated variances (Non Parametric and Parametric) Then, I evaluate, using OLS OLS ˆβ 1 = β 1, ˆβ 2 = β 2 with a 5% 6 Results 61 Simulation Exercise The results of the simulation are summarized in Table N8 We can observe that the type one error is around 70% for both estimators if we do not control for dependence in a setting in which it exists Moreover, Table N8 shows that the type one error using the cluster-variance is far away from the nominal size, even when multiway clusters, which control for all forms of dependence, are used This is because these estimators do not behave correctly in small samples of N and T For example, the one type error of the state clusters, which control for time dependence, is around 225% and 244% for each beta, respectively These higher probabilities of rejecting the true null hypothesis can be explained by the following two reasons: (i) This Variance estimator does not take into account the current and the lag spatial dependence, (ii) the number of state clusters are not enough 33

34 Table N8: Probability of reject the null hyphotesis when is true In the case of the time clusters, which control for spatial dependence, the type one error for the betas are signicantly higher (more that the double) than the state cluster This could occur because this variance estimator does not take into account neither the dependence across time nor the dependence across states at dierent moments in time And also we must notice that the number of clusters by time are signicantly smaller than the clusters by individuals (31 vs 50), so the behavior of this variance estimator is poorly However, it is remarkable that the two way cluster and the multiway clustering approach which controls for all kind of dependence (despite they behave better than the one way cluster variances), still has a high probability of rejecting the true null hypothesis (around 20%) These results reinforce the importance of having large samples for N and T, when we want to use the clustering approach to control for general forms of dependence Therefore, the Non Parametric approach does not behave well in small samples as we have already discussed in the methodology Finally, the parametric variance estimator is the only one which almost attained the real nominal size of the test for both betas The main conclusions that emerge from the simulation exercise are the following: (i) Ignoring error dependence in a setting were exists, will lead to higher rates of rejection of the true hypothesis (accepting spurious regressors when there is not a causal relationship) (ii) Using Non Parametric clustering approach to deal with error dependence in a setting where the number of clusters are not 34

35 so large, leads to a bad approximation to the variance covariance matrix, as it is reected with a signicantly high type one error relative to the nominal size of the test ( because this estimators need N and T go to innity) (iii) In small samples, using the parametric approach is a better way to control for the dependence, in case the researcher is condent about the parametric model 62 Robust Standard Errors for AMS model In this subsection, I re-estimate the model of AMS with the intention of computing standard errors using the methodologies previously discussed in this article (one-way cluster, two-way cluster, multiway cluster and parametric approach) Then, we can analyze how results change once robust standard errors were used in a context where there is error dependence in more than one dimension In the following subsections I will discuss separately the variance estimates for marginal eects arising from: (i) OLS estimate of the US state inequality over the state eective minimum and its square, plus xed eects and time eects as controls (ii) 2SLS estimation after instrumenting both, the state eective minimum wage and the square of the state eective minimum wage The instruments are the statutory minimum and the square of the predicted value from a regression between the state eective minimum wage and the statutory minimum plus xed and time eects, respectively 35

36 Table N9: Dierent variance estimators of the OLS estimator of the parameteres of AMS model As can be seen in Table 9, the standard errors for the OLS estimates of the minimum wage and its square increase approximately in 200% when we switch from non-cluster to state cluster or time cluster When we only cluster by one dimension, either by state or by time, the OLS estimator of the minimum wage variable remains signicant at the 1% condence level However, the OLS estimator of the square of the minimum wage reduces its individual signicance at 5% level with respect to the non-cluster case However, when we use variance estimators that take into account the dependence on more than one dimension, the OLS estimator of the square of the eective minimum wage loses signicance at all levels, while the OLS estimator of the minimum wage remains signicant, but this time at the 5% for the multiway cluster and at the 10% using the parametric variance Given, that we are interested in the total marginal eect of the eective minimum rather than each of its eect separately ( ˆβ 1, ˆβ 2 ), and because the standard errors could be higher due to a multicolinearity between the eective minimum wage and its square, I present in Table N10, the result of the estimated average marginal eect of the minimum wage over US State inequality, which is given by 36

ˆβ 1 + 2 ˆβ 1 N T 2 NT i=1 t=1 [logwagemin it logwage it(50)], 9 Table N10: Dierent variance estimators of the OLS estimator of de marginal eect of AMS model The standard error of the estimated

37 ˆβ ˆβ 1 N T 2 NT i=1 t=1 [logwagemin it logwage it(50)], 9 Table N10: Dierent variance estimators of the OLS estimator of de marginal eect of AMS model The standard error of the estimated marginal eect, increase in 200% when we used a one wayclustering(either by time or space) and 300% when we used a multiway-clustering approach, rather than without clusterdespite this fact, the estimated marginal eect still remains signicant at the 5% level Finally, when we use the parametric variance to control for the dependence, the marginal eect of the minimum wage over US State inequality loses signicance As AMS stress, the OLS estimator of the model could be aected by a division bias problem, because the variable logwage it(50) appears in both sides of the regression Therefore, the 2SLS estimation is more appropriate for study the eect of the minimum wage over US state inequality Table N11 presents the results for the 2SLS estimation 9 In Autor et al (2010), only present the results for the estimated marginal eect 37

Robust Standard Errors to spatial and time dependence in. state-year panels. Lucciano Villacorta Gonzales. June 24, Abstract

Robust Standard Errors to spatial and time dependence in state-year panels Lucciano Villacorta Gonzales June 24, 2013 Abstract There are several settings where panel data models present time and spatial