Robust Standard Errors to spatial and time dependence in. state-year panels. Lucciano Villacorta Gonzales. June 24, Abstract

Similar documents
Robust Standard Errors to spatial and time dependence when. neither N nor T are very large. Lucciano Villacorta Gonzales. February 15, 2014.

Robust Standard Errors to Spatial and Time Dependence in Aggregate Panel Models

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

Econometrics of Panel Data

Inference about Clustering and Parametric. Assumptions in Covariance Matrix Estimation

GMM estimation of spatial panels

1 Estimation of Persistent Dynamic Panel Data. Motivation

Finite Sample Performance of A Minimum Distance Estimator Under Weak Instruments

GMM-based inference in the AR(1) panel data model for parameter values where local identi cation fails

A Practitioner s Guide to Cluster-Robust Inference

Spatial Econometrics

Testing Random Effects in Two-Way Spatial Panel Data Models

Advanced Econometrics

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35

Econometrics of Panel Data

A Course on Advanced Econometrics

Casuality and Programme Evaluation

Wild Bootstrap Inference for Wildly Dierent Cluster Sizes

Outline. Overview of Issues. Spatial Regression. Luc Anselin

Spatial Regression. 11. Spatial Two Stage Least Squares. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Notes on Heterogeneity, Aggregation, and Market Wage Functions: An Empirical Model of Self-Selection in the Labor Market

A Robust Approach to Estimating Production Functions: Replication of the ACF procedure

Dealing With Endogeneity

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Regression with time series

Auto correlation 2. Note: In general we can have AR(p) errors which implies p lagged terms in the error structure, i.e.,

Chapter 2. Dynamic panel data models

Lecture 6: Dynamic panel models 1

ECON 4160, Lecture 11 and 12

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

Some Recent Developments in Spatial Panel Data Models

Econometric Analysis of Cross Section and Panel Data

The exact bias of S 2 in linear panel regressions with spatial autocorrelation SFB 823. Discussion Paper. Christoph Hanck, Walter Krämer

DSGE Methods. Estimation of DSGE models: GMM and Indirect Inference. Willi Mutschler, M.Sc.

dqd: A command for treatment effect estimation under alternative assumptions

ECON 616: Lecture Two: Deterministic Trends, Nonstationary Processes

Lecture 1: OLS derivations and inference

An estimate of the long-run covariance matrix, Ω, is necessary to calculate asymptotic

1 Motivation for Instrumental Variable (IV) Regression

Least Squares Estimation-Finite-Sample Properties

Multiple Equation GMM with Common Coefficients: Panel Data

Linear Regression. Junhui Qian. October 27, 2014

Final Exam. Economics 835: Econometrics. Fall 2010

On the Power of Tests for Regime Switching

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

A SPATIAL CLIFF-ORD-TYPE MODEL WITH HETEROSKEDASTIC INNOVATIONS: SMALL AND LARGE SAMPLE RESULTS

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED

Spatial stochastic frontier model

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Regression and Statistical Inference

Normal Probability Plot Probability Probability

HETEROSKEDASTICITY, TEMPORAL AND SPATIAL CORRELATION MATTER

Least Squares Estimation of a Panel Data Model with Multifactor Error Structure and Endogenous Covariates

DSGE-Models. Limited Information Estimation General Method of Moments and Indirect Inference

Dynamic Regression Models (Lect 15)

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2016 Instructor: Victor Aguirregabiria

Non-linear panel data modeling

Economics 582 Random Effects Estimation

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Spatial panels: random components vs. xed e ects

ECONOMETRICS. Bruce E. Hansen. c2000, 2001, 2002, 2003, University of Wisconsin

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Lecture 4: Heteroskedasticity

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Volatility. Gerald P. Dwyer. February Clemson University

Large Sample Properties & Simulation

Flexible Estimation of Treatment Effect Parameters

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Instrumental Variables and the Problem of Endogeneity

Appendix A: The time series behavior of employment growth

Introduction to Estimation Methods for Time Series models Lecture 2

Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

ECONOMETRICS HONOR S EXAM REVIEW SESSION

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

Econometrics of Panel Data

Econ 582 Fixed Effects Estimation of Panel Data

Economics 308: Econometrics Professor Moody

Switching Regime Estimation

In Chapter 2, some concepts from the robustness literature were introduced. An important concept was the inuence function. In the present chapter, the

The regression model with one stochastic regressor (part II)

Short Questions (Do two out of three) 15 points each

Econometrics I. Ricardo Mora

Interpreting Regression Results

splm: econometric analysis of spatial panel data

Økonomisk Kandidateksamen 2004 (I) Econometrics 2. Rettevejledning

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Discussion of Bootstrap prediction intervals for linear, nonlinear, and nonparametric autoregressions, by Li Pan and Dimitris Politis

Short T Panels - Review

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Lecture 6: Hypothesis Testing

Introductory Econometrics

A Guide to Modern Econometric:

Maximum Likelihood (ML) Estimation

Tests forspatial Lag Dependence Based onmethodof Moments Estimation

Comparing Forecast Accuracy of Different Models for Prices of Metal Commodities

Testing Overidentifying Restrictions with Many Instruments and Heteroskedasticity

Transcription:

Robust Standard Errors to spatial and time dependence in state-year panels Lucciano Villacorta Gonzales June 24, 2013 Abstract There are several settings where panel data models present time and spatial dependence in the error term Not taking into account the error dependence could misestimate the standard errors leading to an inaccurate statistical inference In this article I study the small sample behavior of dierent variance estimators that are robust to the presence of spatial and time dependence in the error of the model The two way cluster estimator of Cameron et al (2011), perhaps the most used variance estimator to deal with more than one type of dependence, converges to its asymptotic normal distribution at the rate min (N, T ), rather than at the rate NT As a consequence, test based on the two way clustering estimator perform poorly when N and T are not large enough, as in the case of state year panels or country year panels In those settings the gain of imposing some structure in the error dependence could be important I propose a more parsimonious structure to estimate the variance of the OLS estimator that are robust to the presence of spatial and time dependence in the error term of the model and have better properties in small samples than the two way cluster estimator In addition, I propose a non linear least square estimation that is consistent and computationally ecient under the presence of xed and time eects in middle panels I study the behavior of the dierent approaches with a Monte Carlo exercise based on a state-year panel data of wage inequality and minimum wage in US Finally I re-estimate the model of Autor et al (2010) that studies the relationship between the state wage inequality and the state minimum wage in the US with a panel of sample size of N = 50 and T = 30 When both types of error dependence are considered in the variance estimator of the OLS estimator of the marginal eect of the minimum wage over wage inequality, the signicance of such eect is lost *I would like to thank Manuel Arellano and Stephane Bonhomme for their guide and specially for share part of their time in discussions which were very useful to have a better understanding of econometrics As always, errors and omissions are exclusively my own responsibility 1

1 Introduction Panel data models have been increasingly used in applied econometrics They oer the possibility to control for some kind of endogeneity without the need of exogenous instruments and allow us to consider richer models, including dynamics, time eects and factor structures Nevertheless, once we move to the panel world we should be aware of the possibility of correlations across time and across individuals in the error term of the model beyond the ones controlled by individual and time dummies In setups where the error of the model presents time and spatial correlation, the variance of the OLS estimator of the parameters of the model will be a function of this time and spatial correlation 1 Hence, it is crucial to correctly estimate such variance for accurate statistical inference Not taking into account these types of error dependence would lead to an underestimation of standard errors increasing dramatically the probability of type I error In particular, Cameron et al (2011) showed two empirical examples where standard errors increased more than threefold when both type of error dependence were taken into account In this article I study dierent approaches for consistently estimate the variance of the OLS estimator that are robust to spatial and time dependence in panels in which the number of cross sections N and time series T are of similar size and are not large enough, as in the state-year panels or country - year panels, where the number of states and countries are limited and the availability of time series data is not large enough Computing a variance estimator that is robust to both type of dependence is not common in applied econometrics In particular Bertrand et al (2004), Hoeschle (2007) and Drummond et al (2009) provide evidence of a wide range of empirical applications where the error term of a model could exhibit one or more form of dependence and which is usually not taken into account or corrected in the right way How to estimate the variance of the OLS estimator when there is spatial and time dependence? One possibility put forward in the literature is the estimator proposed by Cameron et al (2011) This estimator is totally non parametric and relies on the spirit of White (1980) for independent 1 As well as the variances of other estimators such as IV, GMM, etc 2

heteroskedasticity errors In particular, they extend the one-way cluster approach of Arellano (1987) to a multiway clustering that takes into account error correlation in more than one dimension In the panel data setup with time and spatial dependence we can apply a two way clustering procedure, decomposing the total variance in two separated parts: (i) one that only contains the spatial correlation, which is estimated by grouping the data by time clusters and (ii) one that only contains the time correlation, which is estimated by grouping the data by spatial clusters The main advantage of this approach is that it does not require any specication for the variance covariance matrix, allowing for general forms of dependence Nevertheless, in order to have an accurate approximation, this variance estimator needs a suciently large number of both types of clusters, since their asymptotic properties are fullled when the number of clusters goes to innity Moreover, the rates at which each cluster estimator -spatial cluster and time cluster- converges to its asymptotic normal distribution are N and T respectively, rather than at the rate NT, implying that the two way cluster estimator will converge at a slower rate to its asymptotic distribution As a consequence, test based on the the two way clustering estimator perform poorly when N and T are not large enough In settings where N and T are not large enough, the gain of imposing some structure in the error dependence could be important For example, if we limit the dependence with some mixing assumption we can consider alternative estimators with better small sample properties than the two way cluster estimator For example, we can compute a double Heteroskedasticity and Autocorrelation Consistent variance estimator (HAC) With a mixing assumption in the time series, we can use the HAC estimator of Newey West for taking into account the time dependence, and with a mixing assumption in the cross section we can estimate a spatial HAC version as in Conley (1999) or Kelejian and Prucha (2003) A third alternative, that could be more convenient in small samples, is to impose more structure modeling the dependence in a parametric way The purpose of this paper is twofold In a one hand, I study the behavior of the non parametric clustering approach in middle panels with spatial and time dependence, where the number of cross sectional and time series observation are of similar size and are not so large, as in the state-year panel or the country-year panel I will study the properties of the two way clustering estimator, both analytically and numerically As in Hansen (2007) for time series dependence, I will analyze, in Section 3

31, the properties of the two way clustering when we have time and spatial dependence and N and T are of the same magnitude In addition, in Section 4, I will study the behavior of the two way clustering approach with a Monte Carlo exercise based on a state-year panel data of wage inequality and minimum wage The second purpose of the paper is to propose a set of more parsimonious structures to estimate the variance of the OLS estimator that are robust to the presence of spatial and time dependence in the error term of the model and have better properties in small samples than the two way cluster estimator For that purpose, in Section 32, I built on the Spatial Autoregressive model (SAR) proposed by Cli and Ord (1973) and well studied by Anselin (1988, 1992), Baltagui et al (2003, 2007), Elhorst (2003), Kelejian and Prucha (1998, 2001, 2007), Lee (2004, 2007) and Lee and Yu (2006, 2007, 2008), but including terms that also take into account the time dependence In fact, the two way clustering approach of Cameron et al (2011) could be expressed in terms of a fully exible spatial and time autoregressive model Depending on the context, we can use less exible models nested on the exible one in order to account for both type of dependence In Section 32, I show how to estimate the underlying parameters of the spatial and time autoregressive model of order one In addition, I propose a non linear least square estimation of this model that is consistent and computationally ecient under the presence of xed and time eects in middle panels In section 4, I work with the state-year panel model of Autor et al (2010), that studies the relationship between the state wage inequality and the state minimum wage in a panel with xed and time eects and with sample size of N = 50 and T = 30 I use this specic application for two motives: (i) to study the small sample behavior of the two way clustering approach and the spatial and time autoregressive model in an empirical application, and (ii) to analyze how the results and conclusions change once we control for several forms of dependence Using Monte Carlo simulations, for a model with similar characteristics as the Autor's model in terms of regressors, sample size and structure of the error dependence, I found a probability of type I error of around 70% instead of the nominal size of 5% when error dependence is ignored The two 4

way cluster approach has 16% of type I error when the nominal size of the test is 5% On the other hand, using standard errors which incorporate the dependence in a parametric way, leads to attain the nominal size of the test Finally in section 5, I re-estimate the model of Autor et al (2010) with the intention of computing standard errors using the methodologies discussed in this article (one-way cluster, two-way cluster and parametric approach) The standard errors of the OLS estimator of the marginal eect of eective minimum wage over inequality, increases by 200% when I use a one way-clustering (either by time or state) and by 300% when I use a two way-clustering approach, rather than without cluster Despite this fact, the estimated marginal eect still remains signicant at the 5% level However, when I use the parametric variance to control for the dependence, the marginal eect of the minimum wage over US state inequality loses the signicance Finally the standard error of the estimated average marginal eect by 2SLS, increases by 100% and by 200% when I use the two way cluster approach and the parametric approach, respectively Using either the two way clustering or the parametric variance to account for the dependence, the 2SLS estimation of the marginal eect of the minimum wage over US state inequality is not signicant anymore 5

2 A panel model with spatial and time dependence Consider a linear panel regression model dened by: y it = x it β + α i + δ t + u it (21) where i = 1,, N indexes individuals, t = 1,, T indexes time, x itis a vector of observables covariates, α i is an individual xed eect that is constant over time, δ t is a time eect that is common across individuals and u it is an unobservable error component For simplicity, as in Hansen (2007) I work with a transformation of the variables in order to remove the nuisance parameters α i and δ t y it = x itβ + u it (22) where the variables in (22) are the variables in (21) deviated with respect to their own individual sample mean and time sample mean Assumption 1: E (u it x it ) = 0 The OLS estimator of β from equation (22) is dened as ˆβ = ( N i T t x itx it) 1 ( N i T t x ity it ) Under assumption 1 and regularity conditions, ( ) d d NT ˆβ β N (0 V ), where V is the asymptotic ( N variance of the OLS estimator and is equal to Q 1 W Q 1 1, where T Q = lim dnt NT i t x itx it) and [ ( 1 W = lim dnt NT V AR N )] T i t x itu it Until this moment I have not established the correlation structure of u it In fact, the speed of convergence of the OLS estimator d NT and the form of W and V will depends on the correlation structure in u it and x it In the more general case of dependence: ( W = 1 N NT V AR i ) T x it u it t = 1 NT ( N T V AR i=1 t ) N 1 x it u it + 2 N i=1 j=i+1 ( T COV x it u it t T t x jt u jt ) 6

Assumption 2 : E (u it u jt ) 0, E (u it u is ) 0, E (u it u js ) = 0 Assumption 2 implies spatial dependence across individuals in the same time and time dependence for each individual, but rules out correlation between dierent individuals in dierent moments in time Under assumption 2 the expression for W will be the following: ( W = 1 N T ) N 1 V AR x it u it + 2 NT i=1 By linearity of the covariance t N i=1 j=i+1 COV ( T t x it u it x jt u jt ) ( W = 1 N T ) V AR x it u it + NT i=1 t T 2 t N 1 N i=1 j=i+1 COV (x it u it x jt u jt ) If we add and subtract N i T t V AR (x itu it ) we can express W as: [ W = 1 N ( T ) V AR x it u it + NT i=1 t ( T N ) V AR x it u it t t N i ] T V AR (x it u it ) t Let dene to dierent ways of group the data: U i = u i1 u i2 and U t = u t1 u t2 u it u tn W = 1 NT N i=1 E (X iu i U ix i ) + 1 NT T t E (X tu t U tx t ) 1 NT N i T V AR (x it u it ) I have separated W in three parts: a rst one that only take into account the time dependence, a second one that take into account only the spatial dependence and a third part that only takes into account the variance of each observation t W = 1 N N E (X iω ix i ) + 1 T i=1 T t E (X tω tx t ) 1 NT N i T V AR (x it u it ) t 7

Where Ω i = 1 T E [U iu i X i] and E (X i Ω i X i) = 1 T T t=1 E ( x it x it it) u2 +2 1 T 1 T T t=1 s=t+1 E (u itu is x it x is ) and Ω t = 1 N E [U tu t X t ] and E (X tω tx t ) = 1 N N i=1 E ( x it x it u2 it) +2 1 N N 1 N i=1 j=i+1 E ( u it u is x it x jt) Notice that when N, if we do not have weak dependence in the cross section, W will not be nite, the term lim N 2 1 N 1 N N i=1 j=i+1 E ( ) u itu isx itx jt and we would not be able to dene an asymptotic distribution for ˆβ unless we add some assumptions about the spatial dependence or we change the formula for the asymptotic variance Therefore, if we do not assume any mixing condition about the dependence, we have to divide W by N in order to obtain a nite variance covariance matrix The case when T and there is no mixing in the time series is exactly the analogues, the term T 1 T t=1 s=t+1 E (uituisxitx is) and we need to divide W by T lim T 2 1 T Before start analyzing estimators of W, which is the main purpose of this section, I will discuss the properties of the OLS estimator of β under dierent assumptions about the error dependence and dierent situations about the sample size of the cross section and the time series Remark 1 : When {N, T } and there is mixing in the spatial and the time dependence, the speed of convergence of the OLS estimator to their asymptotic distribution is d NT = NT Remark 2 : When {N, T } and there is mixing in the cross section but not in the time series, then the OLS estimator is N consistent rather than NT consistent Intuitively, this slower rate of convergence is because the time series data are less informative since we are allowing for general form of time dependence and we learning only from the cross section Remark 3 : When {N, T } and there is mixing in the time series but not in the cross section, then the OLS estimator is T consistent rather than NT consistent Intuitively, this slower rate of convergence is because the cross sectional data are less informative since we are allowing for general form of spatial dependence and we learning only from the time series Remark 4 : No consistency of ˆβ Even under assumption 1, the OLS estimator is not consistent in the following scenarios: (i) N, T is xed and there is no mixing in the spatial dependence 8

(ii) T, N is xed and there is no mixing in the time series dependence, (iii){n, T } and there is no mixing neither in the spatial nor in the time dependence In the following subsections, I will analyze the properties of dierent estimators of W and V, when we have both spatial and time dependence 3 Robust estimators of the variance of the OLS estimator under spatial and time dependence The purpose of this section is study the properties of the two way cluster estimator of V as a robust way of taking in to account the spatial and time dependence in the error term and give some insight about their assumptions For that purpose I extend the results in Hansen (2007), who analyzed the properties of the one way cluster estimator in a panel with only time dependence in the errors, to the two way cluster estimator in a framework with spatial and time dependence In order to obtain a consistent estimator of V, we need to get a consistent estimator of W Once we get Ŵ we are done ( ( 1 N ˆV = NT i 1 ( ( T x it x it)) 1 N Ŵ NT t i T x it x it t )) 1 The idea is to get a consistent estimator of W = 1 N NT i=1 E (X i U iu i X i)+ 1 T NT t E (X tu t U tx t ) T t E ( ) x it x it u2 it Therefore, we need to consistently estimate each of the three parts of W 1 NT N i 31 Two way clustering estimator The two way clustering estimator proposed by Cameron et al (2011) extend the one way cluster estimator of Arellano (1987) to a setting of correlation in two dimensions In fact, Cameron et al (2011), proposed to estimate each of the two rst parts of W by a one way cluster estimator The rst part of W will be estimated by grouping the data by individual clusters and the second part will be estimated by grouping the date by time clusters Notice that the third part of W is the total mean of each of the variances and can be consistently estimated using the White formula : 9

1 NT N i T t ( ) xit x itû2 where û it are the OLS residuals of (22) Let me discuss a little bit about the insights of this estimator The rst part of W contains the time correlation of each individual: 1 N N i=1 E (X i Ω i X i), where Ω i is a T xt matrix that contains all the time dependence of individual i: Ω i = 1 T E [U iu i X i] And in the more general case of time dependence each of the Ω i contain T (T + 1)/2 dierent elements Nevertheless is important to remark that we do not need to consistently estimate each of the Ω i, we only need to consistently estimate an average of this individual clusters: 1 N N i=1 E (X i U iu i X i) This is the same problem as when we work with heteroskedasticty errors Therefore we can extend the ( ) 1 robust variance estimator proposed by White to the data grouped by individual N NT i=1 X iûiû i X i In fact, this is the one way cluster estimator for panel data proposed by Arellano (1987) and studied by Hansen (2007) as: In order to understand the conditions for consistency of this estimator we can express X i U iu i X i X iu i U ix i = E (X iu i U ix i ) + ɛ i where E [ɛ i ] = 0 Then 1 N N X iu i U ix i = 1 N i=1 N E (X iu i U ix i ) + 1 N i=1 N i=1 ɛ i Therefore 1 N N i=1 X i U iu i X i will be a consistent estimator of 1 N N i=1 E (X i U iu i X i) as long as 1 N N i=1 ɛ p i E [ɛ i ] This will happen only when N Given assumption 1, we can replace the unobserved vector U i by the consistent estimator Ûi in the one way cluster formula ( ) 1 N Remark 5: The estimator NT i=1 X iûiû i X i is a function of the weighted average of all the 10

individual-cluster variances, therefore, in order to derive the asymptotic we need the number of cluster N to go to innity Remark 6: When {N, T } the speed of convergence of 1 NT N i=1 ( ) X iûiû i X i to their asymptotic distribution is N (see Theorem 2 and 3 of Hansen (2007)) The second part of W contains the spatial correlation of the error in each moment in time: 1 T T t=1 E (X tω tx t ), where Ω t is a NxN matrix that contains all the spatial dependence in time t As in the previous case we do not need a consistent estimator of each Ω t, we need a consistent estimator of 1 T T t=1 E (X tu t U tx t ) Therefore we can group the data by time clusters and compute ) 1 the following expression: T T t (X tûtû tx t The last expression will be a consistent estimator of 1 T T t E (X tu t U tx t ) as long as 1 T T t=1 ɛ t E [ɛ t ]This will happen only when T ) 1 T Remark 7: The estimator NT t (X tûtû tx t is a function of the weighted average of all the time-cluster variances, therefore in order to derive the asymptotic we need the number of cluster T to go to innity Remark 8: When {N, T }, the speed of convergence of 1 NT T t ) (X tûtû tx t to their asymptotic distribution is T p The two way cluster estimator is the following: Ŵ 2W CLUST ER = 1 NT N i=1 (X iûiû ix i ) + 1 NT T t (X tûtû tx t ) 1 NT N i T ( xit x itû 2 it) t where Ûi is a T x1 vector that groups the T residuals for individual i and Ût is a Nx1 vector that groups the N residuals in timet Remark 9: The two way cluster estimator needs that both N and T goes to innity, because the rst and the second part of Ŵ 2W CLUST ER are consistent when N goes to innity and when T goes 11

to innity respectively Remark 10: The speed of convergence of Ŵ 2W CLUST ER is min (N, T ) Therefore, in a panel model where NxT is large, but actually N and T separately are not, as in state-year panels, the two way cluster estimator of the variance of the OLS estimator will behave poorly and inference based on this estimator and the normal approximation could be misleading In section 4, I will study the behavior of the two way clustering approach with a Monte Carlo exercise based on a state-year panel data of wage inequality and minimum wage It is true that the two way clustering approach is the ideal way of estimating the variance, because do not impose any restriction about the form of the dependence Nevertheless this freedom comes with the requirement of a large number of observations in both the cross section and time series In settings where N and T are not large enough, as in the state-year panels or country - year panels, the gain of imposing some structure in the error dependence could be important For example, as I emphasized in section 2, the OLS estimator of the parameter of a panel model is inconsistent when there is spatial and time dependence in the error term of the model and there is no mixing neither in the cross section nor on the time series Hence, it is necessary to impose a weak dependence assumption in at least one of the dimensions Beside this requirement for the consistency of the OLS estimator, the mixing assumption give us the possibility of consider alternative estimators of the variance as the double Heteroskedasticity and Autocorrelation Consistent variance estimator (HAC) With a mixing assumption in the time series, we can use the HAC estimator of Newey West for taking into account the time dependence, and with a mixing assumption in the cross section we can estimate a spatial HAC version as in Conley (1999) The double HAC estimator should have better small sample properties than the two way cluster estimator In a recent paper Bester et al (2011) proposed a test based on the cluster estimator when the number of clusters is small Rather than used the normal approximation they derive a limiting distribution for the t statistic, treating the cluster estimator as random variable Nevertheless this approach needs that the number of elements in each cluster is large relative to the number of clusters A third alternative, that could be more convenient when N and T are of the same size and are 12

not large enough, is to impose more structure modeling the dependence in a parametric way The way we should model the dependence will depend on the specic application In particular, Hansen (2007) remarks that the bias of a simple parametric estimators is also typically smaller in the case where the parametric model is correct, making this approach preferable when the researcher is condent about the form of the error process In fact, the two way clustering estimator can be expressed in terms of a fully exible spatial and time autorregressive parametric model (FSTARM) Depending on the context, we can use less exible models nested on FSTARM in order to account for both type of dependence The spatial and time autorrregressive model of order one is an special case of the FSTARM, in which the spatial and time dependence can be summarized in a few number parameters (two in this case) Therefore the estimation of the variance covariance matrix of the OLS estimator only depends on the estimation of this few parameters Under correct specication of the model for the dependence of the error, the estimation of the variance of the OLS estimator will have better small sample properties than the two way clustering approach Obviously there is the trade o between fragility of the parametric model and bad behavior in small samples of the non parametric approach, but given that the totally non parametric approach can be expressed in terms of the fully parametric model, I want to emphasized that under some assumptions of the error dependence,specic to the context we are working on, we can improve the estimation of the variance of the OLS estimator without depart too much from the exible two way clustering In the next subsection I present and discuss a parametric model that take into account the spatial and time dependence and I show how to estimate the underlying parameters of a spatial and time autoregressive model of order one In addition, I propose a non linear least square estimation that is consistent and computationally ecient under the presence of xed and time eects in middle panels 32 Parametric Approach: 321 The SAR model An alternative approach to deal with the error dependence is to model it in a parametric way, 13

ie imposing some structure to the variance covariance matrix In a small sample framework, this methodology behaves better than clusters, since the variance estimator 2 will be NT consistent ARMA models are pretty known and used to model time dependence, however, models for spatial dependence are not so well known because cross sectional data do not have an order as opposed to time series data Notwithstanding, the interest and developments in spatial econometrics have been increasing in the last 10 years as a consequence of the increasing use of cross sectional data and panels with large N in applied econometrics In this way, models for specifying, estimating and testing spatial dependence have been developed in leading and pioneers works such Anselin (1988,1999), Baltagi et al (2003), Baltagi et al (2010), Kelijian and Prucha(1998,1999), Lee and Yu (2007,2008),etc All of these papers, are based in the spatial autoregressive model (SAR) proposed by Cli and Ord (1973, 1981) whose specication is similar to the time series autoregressive model However, there are several important dierence between them In this section I will only comment three of the most important ones The spatial autoregresive model of order one -SAR(1)- for the error process of equation?? is the following: u it = λ N j=1 wijujt + εit Or in matrix form: U(t) = λw N U(t) + ɛ(t) (31) where U(t) is a Nx1 vector which contain each of the shocks u it for the N observations in time t λis the spatial parameter, ɛ(t) is a Nx1 vector of innovations and W N is a NxN matrix which is 2 If the model is stationary 14

determined by the researcher before the estimation Each element w ij allows for direct correlation between two dierent cross sectional units In rst place, due to the fact that spatial units do not have an established order, there is no corresponding concept in the spatial domain, so, maintaining the idea that near realizations of a stochastic process are related, a spatial lag operator w ij is used to allow for correlation between units that are near in distance In this way a spatial unit is a function of a weighted average of other realizations of the same stochastic process at neighborhood locations So the weighted operator w ij or the weighted matrix W N which contains all the weights, plays an important role in modeling spatial dependence This weighting matrix has the following features: (i) Each w ij is nonzero when the unit i and the unit j are not so far but tend to zero when the distance between units increases This assumption is crucial to have a nite variance covariance matrix for the asymptotic distribution of the OLS estimator when N goes to innity, like an ergodicity property for a time series process (ii) W N is a non stochastic matrix that has to be dened before the estimation and all the elements of its diagonal are equal to zero: w ij is equal to zero if i = j The rst assumption of this point is to avoid identications problems (is impossible to estimate the NxN elements of W N in a cross sectional setting The second assumption, is a normalization of the model and do not have implications for the estimation (iii) The matrix I N λw N is non singular for all values of λ, which ensures that the process U is invertible and can be expressed as a weighted average of the innovations ɛ(t) (iv) In most of the cases the matrix W N is row standardized in such a way that the sum of each row (column) is equal to one This is only for a fact of interpretation In second place, unlike time series models, the fact that λ is less than one in absolute value does not imply covariances stationary 15

V ar(uu ) = σ 2 [(I N λw N ) (I N λw N )] 1 (32) The elements of the diagonal of V ar(uu ) in 32 are not necessary equal and depend on the structure of W N Moreover, a λ less than one is required to ensure the ergodicity of the process which implies a nite variance of U, λand ˆβ The third and most important dierence between the SAR(1) model in 31 and the AR(1) model, is that the OLS estimator of the autoregressive coecient in a spatial model is inconsistent Unlike time series models, the lag term in a spatial model is an endogenous variable which is correlated with the innovations in ɛ(t), independently of the innovations being iid distributed We can see this in the reduced form U(t) = (I N λw N ) 1 ɛ(t), where each shock in the vector U(t) is a combination of all the innovations in the vector ɛ(t) Therefore, the endogeneity and bias of the OLS will depend on the structure of W N and the value of the spatial autoregressive parameter λ An alternative way to understand this problem is to think of a spatial model as a system of N equations, so the OLS will be biased and inconsistent due to the simultaneity bias As a consequence, the spatial lag variable has to be treated as an endogenous variable and the estimation method used for the spatial lag parameter has to exploit valid moment's conditions One way to estimate the spatial autoregressive coecient is via pseudo maximum likelihood assuming that ɛ(t) is normally distributed In this case we can treat U(t) as a multivariate random process which have an unconditional normal distribution: N(0, σ 2 [(I N λw N ) (I N λw N )] 1 ) lnl = N 2 (ln(2π) ln(σ2 )) + ln I N λw N 1 2σ 2 (U (I N λw N ) 1 U) 322 A Parametric Model for Spatial and Time Dependence: In this subsection I present a model which allows for all kind of dependence in a panel data model 16

This model is a generalization of the SAR model described above but with two additional regressor for control the time dependence and the spatial dependence in dierent moments in time: U t = λ(i T W N )U t + ρ 1U t 1 + ρ 2(I T W N )U t 1 + ɛ t (33) Where U t is a NT x1 vector which contain each u it for the N states and T times, and U t 1 is the same vector but with a lag in the time period In this model, the presence of the term λ(i T W N )U t in 33generates, as in the SAR model, the endogeneity problem, so the OLS estimators remain inconsistent To estimate the unknown parameters of this model θ = [λ, ρ 1, ρ 2, σ 2 ] we can use a conditional pseudo maximum likelihood approach as if it were the reduced form of a VAR model 323 Quasi maximum Likelihood: Dening B N = I N λw N, A = ρ 1I N + ρ 2W Nand A 1 = B 1 A and assuming that BN is invertible we can express 33 in a reduced form : N U t = (I T A 1)U t 1 + (I T B 1 N )ɛ t (34) We know the two rst moment of the conditional distribution of U t/u t 1: E[U t/u t 1] = I T A 1)U t 1 17

V AR[U t/u t 1] = σ 2 (I T (B N B N ) 1 Thus, in this case the conditional expectation of the score is zero and the properties of consistency and asymptotic normality of the estimator will be satised as shown in Yu and Lee Jong (2007) lnl(u(2)u(t ), θ) = NT 2 (ln(2π) ln(σ2 )) + (T 1)ln B N 1 2σ 2 [U t (I T 1 A 1 )U t 1 ] (I T 1 (B N B N ) 1 )[U t (I T 1 A 1 )U t 1 ]) where U t is a Nx(T 1) vector and I T 1is an identity matrix 324 Non stationary case One particularity of this spatial time model is that, even when the autoregresive coecient ρ 1 is less than one in absolute value, the vector U(t) could contain unit roots which implies that there may be non stationary components in the data generating process A non stationary case occurs when some of the eigenvalues of A 1 is equal to one As in Yu, Jong and Lee (2007) we can dene an eigenvalue i of the A 1 as ψ in = ρ 1+ρ 2 ϖ in 1+λϖ in, where ϖ in is the i eigenvalue of the matrix W N Moreover, if W N is row normalized from a symmetric matrix, all its eigenvalues will be greater or equal than one 3, then we could have unit roots in U(t) if λ + ρ 1 + ρ 2 = 1 Nevertheless, if the spatial weight matrix W N is row normalized from a symmetric matrix, we can still obtain the consistency and asymptotic normality of the ML and the QML estimator with the same rate of convergence as in the stationary case, as was demonstrated in Yu, Jong and Lee (2007) But obviously with a dierent variance covariance matrix 325 Monte Carlo Simulations: OLS vs QML In this subsection I simulate a spatial process like U(t) = λw N U(t) + ɛ(t) with N = 50 for dierent values of the spatial parameter λ The purpose of the simulation is to evaluate the performance of OLS relative to the conditional QML for a spatial model Then I extend the simulation for a spatial-time model as 33 for two dierent values of the parameters in θ 3 See Ord(1975) 18

I generate one thousand replications of a model with sample size N = 50 and T = 31, where ɛ(t) are generated from independent standard normal distributions The initial value U(1) is generate as N(0, I N ) For each set of generated sample observations I calculate the simulated sample mean and the simulated sample variance of the OLS and the conditional QML estimators This sample mean is 1 constructed as 1000 θ 1 1000 s=1 s and the sample variance as 1000 1000 s=1 ( θ s 1 1000 θ 1000 s=1 s) 2 The results are summarized in Graph 1 and Graph 2 and Table N1 19

Graphic N1: U(t) = λw N U(t) + ɛ(t) The black line in the graph above shows the true values of the spatial parameter, while the blue line shows the average of the OLS estimator for a thousand replications Red dashed lines are the condence intervals calculated with two standard deviations above the mean From the graph we can see that the OLS estimator bias increases as the value of the spatial parameter increases 20

Graphic N2: U(t) = λw N U(t) + ɛ(t) On the other hand, the mean of the conditional QML estimator is very close to the actual values of the spatial parameter In addition, the actual values of the spatial parameter are within the condence interval of the estimator Table N1: U t = λ(i T W N )U t + ρ 1U t 1 + ρ 2(I T W N )U t 1 + ɛ t 21

From table N1 we can observe that the mean of the OLS estimator is far from the true value of the parameters representing the spatial dependence In that sense, the OLS estimator has a bias of 011 (two standard deviations) in the spatial parameter in relation to the bias of 002 of the conditional QML(1 standard deviation) This dierence in the biases increases when the true value of the spatial parameter is higher In this way, when the real value of the spatial parameter is 06 rather than 04, the OLS estimator bias is 021 (7 standard deviations) against 002 of QML estimator (2 standard deviations) The average of the OLS estimator of the parameter that represents the time dependence is very close to the true parameter, as the conditional QML estimator, since the time lag does not suer from endogeneity Finally, the OLS estimator of the parameter that measures the dependence between states at dierent moments in time is also biased, because the variable W N U t 1 is strongly correlated with the variable W N U t which has the endogeneity Thus, the higher the endogenity of W N U t is (which depends on the structure of W N and the value of λ) the higher the bias of the OLS estimator of the parameter related to the variable W N U t 1 is, as shown in the table 4 Montecarlo Simulation: The purpose of this section is twofold The rst one is to study the behavior of the two way clustering approach and the parametric approach in a setup where we have spatial and time dependence in the error term of the model and NT is large but N and T separately are not The second one is study and model the error dependence in a specic application and analyze how the results and conclusions could change once we control for several forms of dependence In order to tackle this two purposes, I will focus on a recent paper that studies the causal eect of the minimum wage on US wage inequality, written by Autor, Manning and Smith, from now on, AMS This research uses a state panel data with N = 50 and T = 30 By using OLS and 2SLS estimators and xed and time eects as controls, the authors conclude that the impact of minimum wage over wage inequality in the period 1979-2009 is highly signicant I argue that their inferences are not so reliable since they are not controlling for error dependence in a setting where it is very likely In fact, Barrios et al (2010) show that yearly earnings have substantial correlation across states which 22

decreases when distance increases This dependence could be explained by geographical or local labor markets features In subsection 31, I briey explain the model in AMS In section 32, I discuss how to estimate a spatial and time autoregresive model to the error term of AMS model and I propose a non linear least square estimation that is consistent and computationally ecient under the presence of xed and time eects in middle panels In section 33, I estimate the parameters of the spatial and time autoregresive model to the error term of AMS In section 34, I generate replications of data with the same characteristics as the AMS's error model in order to evaluate how many times we make type one errors in a regression of a variable with AMS's error structure and the regressors of AMS model (eective minimum wage and its square), for dierent estimators of the variance covariance matrix 41 Autor, Manning and Smith (AMS) In their paper, AMS uses a Panel data of 30 years and 50 states to measure the impact of minimum wage over US wage inequality Their conclusions are based on the estimation of the following regression model: logwage it (10) logwage it (50) = logwage it (10) logwage it (50) + β 1 [logwage Min it logwage it (50)] + β 2 [logwage Min it logwage it (50)] 2 + u it (41) Where wage it(10) wage it(50) is the log of the10 th percentile state wage relative to the log of the median state wage (a proxy of wage inequality) wage it(10) wage it(50) is the latent inequality which is approximated by the sum of a xed and a time eect 4 and [wage Min it wage it(50)] is a measure of the bindingness of the minimum wage for state s in year t, from know on, eective minimum wage Therefore, the estimated marginal eect of the minimum wage over wage inequality is given by the 1 estimation of β 1 + 2β N T 2 NT i=1 t=1 [logwagemin it logwage it(50)] AMS remarks that the estimation of 4 See Autor, Manning and Smith(2010) 23

β 1 and β 2 in 41 could be aected by division bias because the variable logwage it(50) appears in both sides of the regression, which induces an articial positive correlation caused by sampling variation For such reason, they use the maximum value between the federal minimum regulatory wage and the state's minimum wage, from know on statutory minimum, as an instrument The idea is that the federal regulatory wage could aect directly the logwage Min it but not logwage it(50) It is reasonable to think that the federal regulatory wage is an exogenous source of variation that is not directly correlated with the state variation of the wage inequality To instrument the square of the eective minimum, they used the square of the predicted value from a regression of the eective minimum on the statutory minimum using year and states dummies as controls In the AMS model, there may be some shocks that we are not controlling for(apart from the xed and the time eect) For example, we could think on a productivity shock that could aect the inequality in each state and in each time As in Beaudry et al(2010), technological shocks could increase the return to high skill workers relative to low skill workers, increasing the inequality in a specic state These productivity shocks could create an error dependence in the model in three dierent dimensions: (i) As in macro models, productivity shocks could follow an autoregressive process, inducing a time dependence in the error term (ii)state specic productivity shocks that are propagated to other states through some linkages across states For example, productivity shocks could quickly propagated between neighboring states and (iii) productivity shocks could generate some spillovers between states at dierent points in time, ie the new technology is transmitted from one state to another state across time 42 Estimation of a Parametric Model for Spatial and Time Dependence in the residual of AMS's model As a rst step I estimated AMS's model in 41 by OLS and 2SLS controlling for time and state eects in order to obtained the residuals As in Autor et al (2010), In order to built the dependent variable, I followed the following steps : (i) First, I grouped all individual responses from the Current Population Survey Merged Outgoing Rotation Group (CPS MORG) for each year (ii) I used the 24

reported hourly wage for those who reported being paid by the hour, and, if the individual do not have information of hourly wage, I calculated this variable as weekly earnings divided by hours worked in the prior week (iii) Then, I limited the sample to individuals which has more than 18 year and less than 64 and exclude self employed individuals (v) Finally, in order to reduce the inuence of outliers I replaced the percentile 98 and 99 of the wage distribution in each state and year by the 97 percentile value Using these individual wage data, I computed the percentiles of the state wage distributions by sex for 1979-2009, weighting individual observations by their CPS sampling For the minimum wage and the federal wage I used the data reported in table 1 of the Appendix of Autor et al (2010) The estimation of the residual of the model is given by the following expression: û it = logwage it (10) logwage it (50) ˆδ t ˆδ i ˆβ 1 [logwage Min it logwage it (50)] ˆβ 2 [logwage Min it logwage it (50)] 2 (42) Using these residuals I estimated a Spatial Time model as 33 in order to get an estimation of the parameters which characterize the error dependence of AMS model W N is a normalized matrix (by rows) from a symmetric matrix that takes the value of one if the states share a border and zero otherwise The parameters in θ = [λ, ρ 1, ρ 2, σ 2 ] estimated for the residual term model proposed are obtained both by OLS and conditional QML (residuals come from the estimated errors of the OLS and 2SLS model of AMS) are summarized in Table 2 and Table 3; 25

Table N 2 Table N3 As we can see from Table N2 and Table N3, the OLS and conditional QML estimators of the three parameters are highly signicant with a spatial parameter near to 02 (for the QML estimator) Moreover the R-squared of the regressions are around 05 and the Likelihood Ratio test for testing the three forms of autocorrelation, reject the null hypothesis of absence of autocorrelation In this setting NxT is large, but actually N and T separately are not, so, it is possible that the residuals estimated from AMS model could be biased, since they come from the estimation of the xed eects and the time eects which are biased for small T and small N respectively To avoid the bias caused by the xed and time eects estimations on the residual regression, I eliminate these eects deviating the variables of the AMS model with respect to its time and state means In this sense, I transform the data with Q 1 and Q 2 in order to eliminate the time and individual eects of the regression Where Q 1 = I N (I T 1 ıt ıt ) is the matrix which deviates the variables with T respect to its time mean, and Q 2 = (I N 1 ın ın ) IT ) is the matrix which deviates the variables with N respect to its state mean Therefore, I should estimate a Spatial-Time model to the new transformed residuals: U t = λ[i T W N ]U t + ρ 1U t 1 + ρ 2[I T W N ]U t 1 + ɛ t (43) 26

where U = Q 2Q 1U and ɛ = Q 2Q 1ɛ However, this transformation creates other sources of bias in the model, because E[ɛ i /U t 1] is not zero Therefore, the moments used in the conditional QML for the reduced form of the model will not be valid anymore One way to estimate this model, is to use the unconditional likelihood of U : lnl = NT 2 (ln(2π) ln(σ2 )) + ln (Q 2Q 1ΥQ 2Q 1) 1/2 1 2σ 2 (U(Q 2Q 1ΥQ Q 1) 1 U) where Υ is the variance covariance matrix of the NT vector U : Υ = E u 1t u 1t u 1t u 2t u 1t u Nt u 2t u 1t u 2t u 2t u 2t u Nt u Nt u 1t u Nt u 2t u Nt u Nt u 1t+1 u 1t u 1t+1 u 2t u 1t+1 u Nt u 2t+1 u 1t u 2t+1 u 2t u 2t+1 u Nt u Nt+1 u 1t u Nt+1 u 2t u Nt+1 u Nt u 1T u 1t u 1T u 2t u 1T u Nt u 2T u 1t u 2T u 2t u 2T u Nt u 1t u 1T u 1t u T u 1t u NT u 2t u 1T u 2t u 2T u 2t u NT u Nt u 1T u Nt u 2T u Nt u NT u 1T u 1T u 1T u 2T u 1T u NT u 2T u 1T u 2T u 2T u 2T u NT u NT u 1t u NT u 2t u NT u Nt u NT u 1T u NT u 2T u NT u NT A practical diculty with this unconditional QML is that the estimation of the parameters in θ = [λ, ρ 1, ρ 2, σ 2 ] implies computational complexities This is because the computational estimation of the likelihood in 31, involves a repeated evaluation of the determinant of the NT xnt matrix Υ For this reason, I moved to other methodology which exploit valid moment conditions but is computationally easy 421 Non Linear Least Squares Another way to estimate the parameters in θ without imposing any functional distribution for the vector ɛ(t) is to use all the moment conditions inside Υ = E(U tu t) 27

For simplicity I will dene the vector U(t) as the vector of shocks for all the states in time t Therefore, for each time t the parametric model for the error term of AMS will be the following: U(t) = λw N U(t) + ρ 1U(t 1) + ρ 2W N U(t 1) + ɛ(t) or in reduced form: U(t) = A 1U(t 1) + B 1 N ɛ(t) (44) Υ = E U(t)U(t) U(t)U(t + 1) U(t)U(T ) U(t + 1)U(t) U(t + 1)U(t + 1) U(t + 1)U(T ) U(T )U(t) U(T )U(t + 1) U(T )U(T ) This Υ can be expressed as a function of the parameters in θ A 1 is not necessary symmetric and especially does not have all its eigenvalues less than one, even when the U it is time-stationary Therefore each sub matrix of Υ, could depend on the dierence in time We can expressed 44 in its MA form using the lag operator: U(t) = B 1 ɛ(t) + A1B 1 ɛ(t 1) + A2 1B 1 ɛ(t 2) + A3 1B 1 ɛ(t 3) + N N N N In this way if we have the index t = 1, 2, 3T : E[U(1)U(1) ] =(B N B N ) 1 σ 2 I T, E[U(1)U(2) ] =(B N B N ) 1 σ 2 A 1, E[U(1)U(3) ] =(B N B N ) 1 σ 2 A 2 1 ', E[U(1)U(T ) ] = (B N B N ) 1 σ 2 A T 1 1 ' E[U(2)U(1) ] =(B N B N ) 1 σ 2 A 1, E[U(2)U(2) ] = (B N B N ) 1 σ 2 [I N + A 1A 1 ], E[U(2)U(3) ] = (B N B N ) 1 σ 2 [A 1 + A1A2 1 ], E[U(2)U(T ) ] = (B N B N ) 1 σ 2 [A T 2ı 1 + A 1A T 1 1 ] 28

E[U(3)U(1) ] = (B N B N ) 1 σ 2 [A 2 1 ], E[U(3)U(2) ] = (B N B N ) 1 σ 2 [A 1 + A 2 1 A 1 ], E[U(3)U(3) ] = (B N B N ) 1 σ 2 [I N + A 1A 1 + A2 1 A2 1 ], E[U(3)U(T ) ] = (B N B N ) 1 σ 2 [A T 3 1 + A 1A T 2 1 + A 2 1 1AT 1 ] E[U(1)U(T ) ] = (B N B N ) 1 σ 2 A T 1 1,E[U(T )U(2) ] = (B N B N ) 1 σ 2 [A T 2 1 A T 1 1 + A 1 ], E[U(T )U(3) ] = (B N B N ) 1 σ 2 [A T 3 1 +A T 2 1 A 1 1 +AT 1 A 2 1 ], E[U(T )U(T ) ] = (B N B N ) 1 σ 2 [I N +A 1A 1 1 ++AT 1 A T 1 1 ] Using the residuals that come from the transformation of AMS model, we can minimize the quadratic distance between all the moment conditions inside Q 2Q 1Υ(θ)Q 1Q 2 and the sample counterpart of E[ ˆ U ˆ U ] Dene ζ p = vech( ˆ U ˆ U ), and ξ p(θ) = vech(q 2Q 1Υ(θ)Q 1Q 2), we can estimate θas the nonlinear least squares from the following model: ζ p = ξ p(θ) + υ p where p = NT (NT +1) 2 θ NLS = argmin{[ζ p ξ p(θ)] [ζ p ξ p(θ)]} c This estimator will be consistent but inecient because we are not using a correct weighting matrix for the moments in Υ(θ) 5 To compute standard errors for this θ NLS, I used a Monte Carlo simulation generating replications of the following model: U = λ NLS [I T W N ]U NLS + ρ 1 Uit 1 NLS + ρ 2 [I T W N ]Uit 1 + ɛ it For each simulation I generate a NT vector ɛ from a Normal distribution N(0, σ NLS Q 2Q 1Q 1Q 2) The results of the OLS and NLS estimations of the Spatial Time model for the OLS transformed residuals and 2SLS transformed residuals are summarized in Table 4 and Table 5: 5 For simplicity I am using the Identity Matrix as a weighting matrix instead V AR(ζ p) For use the correct matrix I should know the four moments of U it 29

Table N4 Table N5 The results do not change drastically, all the estimations of the 3 parameters are still highly signicant, but now the estimation of the spatial parameter is higher (around 04) and the estimation of the lag spatial parameter (around 01) is lower as compared to the conditional QML estimator of the residuals without deviations 43 Simulating AMS model with error dependence Using these last estimated values of the parameters that characterize the dependence, I will simulate one thousand replications of shocks with time and spatial dependence: U = λ NLS [I T W N ]U NLS + ρ 1 Uit 1 NLS + ρ 2 [I T W N ]Uit 1 + ɛ it For each simulation the NT vector ɛ was created from a Normal distribution N(0, σ NLS Q 2Q 1Q 1Q 2) Then, using the real regressors from the AMS model (eective minimum wage and its square) I simulate the dependent variable of each replication: 30

y J = β 1Q 2Q 1[logwage Min logwage(50)] + β 2Q 2Q 1[logwage Min logwage(50)] 2 + U J, where J = 1 to 1000 The values of β 1 and β 2 come from the estimation of the AMS model Finally, I compute for each replication the OLS estimators of a regression of y J over Q 2Q 1[logwage Min logwage(50)] and Q 2Q 1[logwage Min logwage(50)] 2 and the dierent variance estimator of the OLS estimator using each of the approaches (Non Parametric and Parametric) individual tests, how many times we reject the true null hypothesis of of condence level for each of the estimated variances (Non Parametric and Parametric) Then, I evaluate, using OLS OLS ˆβ 1 = β 1, ˆβ 2 = β 2 with a 5% 5 Results 51 Simulation Exercise The results of the simulation are summarized in Table N8 We can observe that the type one error is around 70% for both estimators if we do not control for dependence in a setting in which it exists Moreover, Table N8 shows that the type one error using the cluster-variance is far away from the nominal size, even when multiway clusters, which control for all forms of dependence, are used This is because these estimators do not behave correctly in small samples of N and T For example, the one type error of the state clusters, which control for time dependence, is around 225% and 244% for each beta, respectively These higher probabilities of rejecting the true null hypothesis can be explained by the following two reasons: (i) This Variance estimator does not take into account the current and the lag spatial dependence, (ii) the number of state clusters are not enough 31

Table N8: Probability of reject the null hyphotesis when is true In the case of the time clusters, which control for spatial dependence, the type one error for the betas are signicantly higher (more that the double) than the state cluster This could occur because this variance estimator does not take into account neither the dependence across time nor the dependence across states at dierent moments in time And also we must notice that the number of clusters by time are signicantly smaller than the clusters by individuals (31 vs 50), so the behavior of this variance estimator is poorly However, it is remarkable that the two way cluster and the multiway clustering approach which controls for all kind of dependence (despite they behave better than the one way cluster variances), still has a high probability of rejecting the true null hypothesis (around 20%) These results reinforce the importance of having large samples for N and T, when we want to use the clustering approach to control for general forms of dependence Therefore, the Non Parametric approach does not behave well in small samples as we have already discussed in the methodology Finally, the parametric variance estimator is the only one which almost attained the real nominal size of the test for both betas The main conclusions that emerge from the simulation exercise are the following: (i) Ignoring error dependence in a setting were exists, will lead to higher rates of rejection of the true hypothesis (accepting spurious regressors when there is not a causal relationship) (ii) Using Non Parametric clustering approach to deal with error dependence in a setting where the number of clusters are not 32

so large, leads to a bad approximation to the variance covariance matrix, as it is reected with a signicantly high type one error relative to the nominal size of the test ( because this estimators need N and T go to innity) (iii) In small samples, using the parametric approach is a better way to control for the dependence, in case the researcher is condent about the parametric model 52 Robust Standard Errors for AMS model In this subsection, I re-estimate the model of AMS with the intention of computing standard errors using the methodologies previously discussed in this article (one-way cluster, two-way cluster, multiway cluster and parametric approach) Then, we can analyze how results change once robust standard errors were used in a context where there is error dependence in more than one dimension In the following subsections I will discuss separately the variance estimates for marginal eects arising from: (i) OLS estimate of the US state inequality over the state eective minimum and its square, plus xed eects and time eects as controls (ii) 2SLS estimation after instrumenting both, the state eective minimum wage and the square of the state eective minimum wage The instruments are the statutory minimum and the square of the predicted value from a regression between the state eective minimum wage and the statutory minimum plus xed and time eects, respectively 33

Table N9: Dierent variance estimators of the OLS estimator of the parameteres of AMS model As can be seen in Table 9, the standard errors for the OLS estimates of the minimum wage and its square increase approximately in 200% when we switch from non-cluster to state cluster or time cluster When we only cluster by one dimension, either by state or by time, the OLS estimator of the minimum wage variable remains signicant at the 1% condence level However, the OLS estimator of the square of the minimum wage reduces its individual signicance at 5% level with respect to the non-cluster case However, when we use variance estimators that take into account the dependence on more than one dimension, the OLS estimator of the square of the eective minimum wage loses signicance at all levels, while the OLS estimator of the minimum wage remains signicant, but this time at the 5% for the multiway cluster and at the 10% using the parametric variance Given, that we are interested in the total marginal eect of the eective minimum rather than each of its eect separately ( ˆβ 1, ˆβ 2 ), and because the standard errors could be higher due to a multicolinearity between the eective minimum wage and its square, I present in Table N10, the result of the estimated average marginal eect of the minimum wage over US State inequality, which is given by 34

ˆβ 1 + 2 ˆβ 1 N T 2 NT i=1 t=1 [logwagemin it logwage it(50)], 6 Table N10: Dierent variance estimators of the OLS estimator of de marginal eect of AMS model The standard error of the estimated marginal eect, increase in 200% when we used a one wayclustering(either by time or space) and 300% when we used a multiway-clustering approach, rather than without clusterdespite this fact, the estimated marginal eect still remains signicant at the 5% level Finally, when we use the parametric variance to control for the dependence, the marginal eect of the minimum wage over US State inequality loses signicance As AMS stress, the OLS estimator of the model could be aected by a division bias problem, because the variable logwage it(50) appears in both sides of the regression Therefore, the 2SLS estimation is more appropriate for study the eect of the minimum wage over US state inequality Table N11 presents the results for the 2SLS estimation 6 In Autor et al (2010), only present the results for the estimated marginal eect 35