Robust Inference for Regression with Clustered Data

Robust nference for Regression with Clustered Data Colin Cameron Univ. of California - Davis Joint work with Douglas L. Miller (UC - Davis). Based on A Practitioners Guide to Cluster-Robust nference, Journal of Human Resources, 2015. August 2015 Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to1 Cluster-R / 58

1. ntroduction ntroduction 1. ntroduction Consider straightforward OLS estimation in linear regression model. Suppose estimator b β is consistent for β. Concerned with getting the correct standard errors of b β default: if errors are i.i.d. (0, σ 2 ) heteroskedastic-robust: if errors are independent (0, σ 2 i ) heteroskedastic and autocorrelation-robust (HAC): if errors are serially correlated cluster-robust: if errors are correlated within cluster and independent across clusters F this paper. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to2 Cluster-R / 58

1. ntroduction ntroduction Why is this important? 1. Cluster-robust standard errors can be much bigger than default or heteroskedastic-robust. 2. So failure to control for clustering overstates t statistics and understates p-values provides too narrow con dence intervals 3. This arises often especially in the empirical / public labor literature using quasi-experimental methods. 4. There are subtleties - not always straightforward to implement. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to3 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster Example 1: ndividuals in Cluster Example: How do job injury rates e ect wages? Hersch (1998). CPS individual data on male wages. But there is no individual data on job injury rate. nstead aggregated data on occupation injury rates 211 OLS estimate model for individual i in occupation g Problem: y ig = α + x 0 ig β + γ z g + u ig. the regressor z g (job injury risk in occupation g) is perfectly correlated within cluster (occupation) F by construction and the error u ig is (mildly) correlated within cluster F if model overpredicts for one person in occupation j it is likely to overpredict for others in occupation j. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to4 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster Simpler model, nine occupations, N = 1498. Summary statistics Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to5 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster Same OLS regression with di erent se s estimated using Stata (1) iid errors, (2) het errors, (3,4) clustered errors global covars potexp potexpsq educ union nonwhite northe midw west regress lnw occrate $covars estimates store one_iid regress lnw occrate $covars, vce(robust) estimates store one_het regress lnw occrate $covars, vce(cluster occ_id) estimates store one_clu xtset occ_id xtreg lnw occrate $covars, pa corr(ind) vce(robust) estimates store one_xtclu estimates table one_iid one_het one_clu one_xtclu, /// b(%10.4f) se(%10.4f) p(%10.3f) stats(n N_clust rank F) Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to6 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster Same OLS coe cients but cluster-robust standard errors (columns 3 and 4) when cluster on occupation are 2-4 times larger than default (column 1) or heteroskedastic-robust (column 2) and p-values in the last two columns di er substantially: t(8) versus N(0, 1) Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to7 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster And cluster-robust variance matrix is rank de cient Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to8 Cluster-R / 58

1. ntroduction Example 1: ndividuals in Cluster Moulton (1986, 1990) is key paper to highlight the larger standard errors when cluster due to regressors correlated within cluster and errors correlated within cluster. The di erent p-values in columns 3 and 4 arise when there are few clusters use t(8) not N(0, 1) The rank de ciency of the overall F-test is explained below individual t-statistics are still okay. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to9 Cluster-R / 58

1. ntroduction Example 2: Di erence-in-di erences in State-Year Panel Example 2: Di erence-in-di erences State-Year Panel Example: How do wages respond to a policy indicator variable d ts that varies by state? e.g. d ts = 1 if minimum wage law in e ect OLS estimate model for state s at time t Problem: y ts = α + x 0 ts β + γ d ts + u ts. the regressor d ts is highly correlated within cluster F typically d ts is initially 0 and at some stage switches to 1 the error u ts is (mildly) correlated within cluster F if model underpredicts for California in one year then it is likely to underpredict for other years. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to10 Cluster-R / 58

1. ntroduction Example 2: Di erence-in-di erences in State-Year Panel Again nd that default OLS standard errors are way too small should instead do cluster-robust (cluster on state) The same problem arises if we have data in individuals (i) in states and years y its = α + x 0 its β + γ d ts + u its in that case should also cluster on state. Bertrand, Du o & Mullainathan (2004) key paper that highlighted problems for DiD in 2004 people either ignored the problem or with its data erroneously clustered on state-year pair and not state. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to11 Cluster-R / 58

1. ntroduction Example 2: Di erence-in-di erences in State-Year Panel Outline 1 ntroduction 2 Cluster-Robust nference for OLS 3 Cluster-Speci c Fixed E ects 4 What to Cluster Over? 5 Multi-way Clustering 6 Few Clusters 7 Extensions (beyond OLS) 8 Empirical Example 9 Conclusion Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to12 Cluster-R / 58

2. Cluster-Robust nference ntuition ntuition Suppose we have univariate data y i (µ, σ 2 ). We estimate µ by ȳ and " N 1 Var[ȳ] = Var N y i = 1 N N i=1 2 i=1 Given independence over i this simpli es to Var[ȳ] = 1 N σ2. N j=1 Cov(y i, y j ) Now suppose observations are equicorrelated with Cov(y i, y j ) = ρσ 2 for i 6= j " # Var[ȳ] = 1 N N 2 Var(y i ) + N N Cov(y i, y j ) i=1 i=1 j=1;j6=i = 1 N 2 [Nσ 2 + N(N 1)ρσ 2 ] = 1 N σ2 f1 + (n 1)ρg. #. olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to13 Cluster-R / 58

2. Cluster-Robust nference ntuition So independent errors Equicorrelated errors Var[ȳ] = 1 N σ2. Var[ȳ] = 1 N σ2 f1 + (N 1)ρg. the variance is 1 + (N 1)ρ times larger. Reason: An extra observation is not providing a new independent piece of information. Note that the e ect can be large if ρ = 0.1 (so R 2 of y i on y j is 0.01) and N = 81 then Var[ȳ] = 9 N 1 σ2 is 9 times larger! olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to14 Cluster-R / 58

2. Cluster-Robust nference 2.2 Clustered Errors 2.2 OLS with Clustered Errors Model for G clusters with N g individuals per cluster: OLS estimator y ig = x 0 ig β + u ig, i = 1,..., N g, g = 1,..., G, y g = X g β + u g, g = 1,..., G, y = Xβ + u. bβ = ( G g =1 N g i=1 x ig x 0 ig ) 1 ( G g =1 N g i=1 x ig y ig ) = ( G g =1 X0 g X g ) 1 ( G g =1 X0 g y g ) = (X 0 X) 1 X 0 y. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to15 Cluster-R / 58

2. Cluster-Robust nference 2.2 Clustered Errors As usual bβ = β + (X 0 X) 1 X 0 u = β + (X 0 X) 1 ( G g =1 X g u g ). Assume independence over g and correlation within g E[u ig u jg 0jx ig, x jg 0] = 0, unless g = g 0. Then β b a N [β, V[ β]] b with asymptotic variance Avar[ β] b = (E[X 0 X]) 1 ( G g =1 E[Xg 0 u g ug 0 X g ])(E[X 0 X]) 1 6= σ 2 u(e[x 0 X]) 1. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to16 Cluster-R / 58

2. Cluster-Robust nference 2.2 Clustered Errors Consequences - KEY RESULT FOR NSGHT Suppose equicorrelation within cluster g 1 Cor[u ig, u jg jx ig, x jg ] = ρ u i = j i 6= j this arises in a random e ects model with u ig = α g + ε ig, where α g and ε ig are i.i.d. errors. an example is individual i in village g or student i in school g. The incorrect default OLS variance estimate should be in ated by τ j ' 1 + ρ xj ρ u ( N g 1), (1) ρ xj is the within cluster correlation of x j (2) ρ u is the within cluster error correlation (3) N g is the average cluster size. Need both (1) and (2) and it also increases with (3) Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to17 Cluster-R / 58

2. Cluster-Robust nference 2.2 Clustered Errors Theory: Kloek (1981), Scott and Holt (1982). Practice: Moulton (1986, 1990) showed that the variance in ation can be large even if ρ u is small especially with a grouped regressor (same for all individuals in group) so that ρ x = 1. CPS data example: N g = 81, ρ x = 1 and ρ u = 0.1 =) τ j ' 1 + ρ xj ρ u ( N g 1) = 1 + 1 0.1 80 = 9. F true standard errors are three times the default! So should correct for clustering even in settings where not obviously a problem. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to18 Cluster-R / 58

2. Cluster-Robust nference 2.3 The Cluster-Robust Variance Matrix Estimate 2.3 The Cluster-Robust Variance Matrix Estimate Recall for OLS with independent heteroskedastic errors Avar[ β] b = (E[X 0 X]) 1 ( N i=1 E[ui 2 x i xi 0 ])(E[X 0 X]) 1 can be consistently estimated (White (1980)) as N! by bv[ β] b = (X 0 X) 1 ( N i=1 bu i 2 x i xi 0 )(X 0 X) 1. Need 1 N N i=1 bu i 2x i xi 0 not bu i 2 p! E[ui 2] 1 N N i=1e[u 2 i x i x 0 i ] p! 0 olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to19 Cluster-R / 58

2. Cluster-Robust nference 2.3 The Cluster-Robust Variance Matrix Estimate Similarly for OLS with independent clustered errors Avar[ b β] = (E[X 0 X]) 1 ( G g =1 E[X 0 g u g u 0 g X g ])(E[X 0 X]) 1 can be consistently estimated as G! by the cluster-robust variance estimate (CRVE) bv CR [ b β] = (X 0 X) 1 ( G g =1 X 0 g eu g eu 0 g X g )(X 0 X) 1. Stata uses eu g = cbu g = c(y g X g b β) where c = G G 1 N 1 N K ' G G 1. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to20 Cluster-R / 58

2. Cluster-Robust nference 2.3 The Cluster-Robust Variance Matrix Estimate The CRVE was proposed by White (1984) for balanced case proposed by Liang and Zeger (1986) for grouped data proposed by Arellano (1987) for FE estimator for short panels (group on individual) Hansen (2007a) and Carter, Schnepel and Steigerwald (2013) also allow N g!. popularized by incorporation in Stata as the cluster option (Rogers (1993)). also allows for heteroskedasticity so is cluster- and heteroskedasticrobust. Stata with cluster identi er id_clu regress y x, vce(cluster id_clu) xtreg y x, pa corr(ind) vce(robust) F F after xtset id_clu from version 12.1 on Stata interprets vce(robust) as cluster-robust for all xt commands. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to21 Cluster-R / 58

2. Cluster-Robust nference 2.4 Feasible GLS 2.4. Feasible GLS with Cluster-Robust nference Potential e ciency gains for feasible GLS compared to OLS. Specify a model for Ω g = E[u g u 0 g jx g ], e.g. within-cluster equicorrelation. Given bω p! Ω, the feasible GLS estimator of β is G bβ FGLS = g =1 X0 b g Ωg 1 X g 1 G g =1 X0 g b Ω 1 g y g. Default bv[ b β FGLS ] = (X 0 bω 1 X) 1 requires correct Ω. To guard against misspeci ed Ω g use cluster-robust bv CR [ b β FGLS ] = 1 X 0 bω 1 G X g =1 X0 b g Ωg 1 bu g bu g 0 Ω b g 1 X g X 0 bω 1 X where bu g = y g X g βfgls b and bω = Diag[ bω g ] assumes u g and u h are uncorrelated, for g 6= h and needs G!. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to22 Cluster-R / 58

2. Cluster-Robust nference 2.4 Feasible GLS FGLS Examples Example 1 - Moulton setting Random e ects model: y ig = x 0 ig β + α g + ε ig F xtreg, re vce(robust) Richer hierarchical linear model or mixed model F Stata 13: mixed, vce(robust) Example 2 - BDM setting AR(1) error u it = ρu i,t 1 + ε it and ε it i.i.d. xtreg y x, pa corr(ar 1) vce(robust) Stata allows a range of correlation structures Puzzle - why is FGLS not used more? Easily done in Stata with robust VCE if G! Unless FE s present and N g small (see later). Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to23 Cluster-R / 58

2. Cluster-Robust nference 2.5 mplementation of OLS and FGLS 2.5 The CRVE can be rank de cient bv CR [ b β] can be rank de cient rank is as most minimum of K and G 1 bb = C 0 C, where C 0 = [X1 0 bu 1 XG 0 bu G ] is K G and X1 0 bu 1 + + XG 0 bu G = 0 For example if have 15 clusters (say states) Cannot jointly test signi cance of 20 occupation dummies But can test joint signi cance of 14. The test of overall joint statistical signi cance is not computable if G < K but tests on individual coe cients are still okay. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to24 Cluster-R / 58

2. Cluster-Robust nference 2.6 Pairs Cluster Bootstrap 2.6 Pairs Cluster Bootstrap Do the following steps for each of B bootstrap samples: (1) form G clusters f(y1, X 1 ),..., (y G, X G )g by resampling with replacement G times from the original sample of clusters (2) compute β b b (estimate of β) in the b th bootstrap sample. Compute the variance of the B estimates b β 1,..., b β B as bv CR;boot [ b β] = 1 B 1 B b=1 (b β b b β)( b βb b β) 0, where b β = B 1 B b=1 b β b and B 400. Pairs cluster bootstrap has no asymptotic re nement. But can compute these if Stata doesn t provide a CRVE. Also can do even if usual CRVE is rank de cient? Also cluster jackknife. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to25 Cluster-R / 58

3. Cluster-Speci c Fixed E ects models 3. Cluster-Speci c Fixed E ects Models: Summary Now y ig = x 0 ig β + α g + u ig = x 0 ig β + G h=1 α g dh ig + u ig. 1. FE s do not in practice absorb all within cluster correlation of u ig still need to uce cluster-robust VCE 2. Cluster-robust VCE is still okay with FE s (if G! ) Arellano (1987) for N g small and Hansen (2007a, p.600) for N g! 3. f N g small use xtreg, fe not reg i.id_clu as reg or areg uses wrong degrees of freedom 4. FGLS with xed e ects needs to bias-adjust for bα g inconsistent Hansen (2007b) provides bias-corrected FGLS for AR(p) errors Brewer, Crossley and Joyce (2013) implement in DiD setting Hausman and Kuersteiner (2008) provide bias-corrected FGLS for Kiefer (1980) error model 5. Need to do a modi ed Hausman test for xed e ects. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to26 Cluster-R / 58

4. What to Cluster Over? 4.1 Factors Determining What to Cluster Over 4.1 Factors Determining What to Cluster Over t is not always obvious how to specify the clusters. Moulton (1986, 1990) cluster at the level of an aggregated regressor. Bertrand, Du o and Mullainathan (2004) with state-year data cluster on states (assumed to be independent) rather than state-year pairs. Pepper (2002) cluster at the highest level where there may be correlation e.g. for individual in household in state may want to cluster at level of the state if state policy variable is a regressor. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to27 Cluster-R / 58

4. What to Cluster Over? 4.2 Clustering Due to Survey Design 4.2 Clustering Due to Survey Design Clustering routinely arises with complex survey data. Then the loss of e ciency due to clustering is called the design e ect This is the inverse of the variance in ation factor given earlier Long literature going back to 1960 s CRVE is called the linearization formula Shah, Holt and Folsom (1977) is early reference. Complex survey data are weighted often ignore assuming conditioning on x handles weighting And strati ed this improves estimator e ciency somewhat Bhattacharya (2005) gives a general GMM treatment. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to28 Cluster-R / 58

4. What to Cluster Over? 4.2 Clustering Due to Survey Design Econometricians reasonably Cluster on PSU or higher Sometimes weight and sometimes not gnore strati cation (with slight loss in e ciency) Survey software controls for all three. Stata svy commands Econometricians use regular commands with vce(cluster) and possibly [pweight=1/prob] Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to29 Cluster-R / 58

5. Multi-Way Clustering 5. Multi-way Clustering Example: How do job injury rates e ect wages? Hersch (1998). CPS individual data on male wages N = 5960. But there is no individual data on job injury rate. nstead aggregated data: F F data on industry injury rates for 211 industries data on occupation injury rates for 387 occupations. Model estimated is What should we do? y igh = α + x 0 igh β + γ rind ig + δ rocc ih + u igh. Ad hoc robust: OLS and robust cluster on industry for bγ and robust cluster on occupation for bδ. Non-robust: FGLS two-way random e ects: u igh = ε g + ε h + ε igh ; ε g, ε h, ε igh i.i.d. Two-way robust: next Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to30 Cluster-R / 58

5. Multi-Way Clustering 5.1 Two-way Cluster-Robust VE Two-way Clustering Robust variance matrix estimates are of the form bavar[ β] b = (X 0 X) 1 bb(x 0 X) 1 For one-way clustering with clusters g = 1,..., G we can write bb = N i=1 N j=1 x i xj 0 bu i bu j 1[i, j in same cluster g] where bu i = y i xi 0 β b and the indicator function 1[A] equals 1 if event A occurs and 0 otherwise. For two-way clustering with clusters g = 1,..., G and h = 1,..., H bb = N i=1 N j=1 x i xj 0 bu i bu j 1[i, j share any of the two clusters] = N i=1 N j=1 x i xj 0 bu i bu j 1[i, j in same cluster g] + N i=1 N j=1 x i xj 0 bu i bu j 1[i, j in same cluster h] N i=1 N j=1 x i xj 0 bu i bu j 1[i, j in both cluster g and h]. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to31 Cluster-R / 58

5. Multi-Way Clustering 5.1 Two-way Cluster-Robust VE Obtain three di erent cluster-robust variance matrices for the estimator by one-way clustering in, respectively, the rst dimension, the second dimension, and by the intersection of the rst and second dimensions add the rst two variance matrices and, to account for double-counting, subtract the third. Thus bv two-way [ β] b = bv G [ β] b + bv H [ β] b bv G \H [ β], b Theory presented in Cameron, Gelbach, and Miller (2006, 2011), Miglioretti and Heagerty (2006), and Thompson (2006, 2011) Extends to multi-way clustering. Early empirical applications that independently proposed this method include Acemoglu and Pischke (2003). Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to32 Cluster-R / 58

5. Multi-Way Clustering 5.2 mplementation 5.2 mplementation f bv[ b β] is not positive-de nite (small G, H) then Decompose bv[ b β] = UΛU 0 ; U contains eigenvectors of bv, and Λ = Diag[λ 1,..., λ d ] contains eigenvalues. Create Λ + = Diag[λ + 1,..., λ+ d ], with λ+ j = max 0, λ j, and use bv + [ β] b = UΛ + U 0 Stata add-on cgmreg.ado implements this. Also Stata add-on xtivreg2.ado has two-way clustering for a variety of linear model estimators. Fixed e ects in one or both dimensions Theory has not formally addressed this complication ntuitively if G! and H! then each xed e ect is estimated using many observations. n practice the main consequence of including xed e ects is a reduction in within-cluster correlation of errors. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to33 Cluster-R / 58

5. Multi-Way Clustering 5.2 mplementation Application Example 1: Hersch data Relatively small di erence versus one-way But can simultaneously handle both ways rather than one-way cluster on industry for bγ and one-way cluster on occupation for bδ. Example 2: DiD We have found little di erence if cluster two-way on state and time versus just one-way on state. Studies in nance view this as important. Example 3: Country-pair international trade volume Two-way cluster on country 1 and country 2 leads to much bigger standard errors (Cameron et al. 2011) Cameron and Miller (2012) nd that two-way still doesn t pick up all correlations. nstead other methods including Fafchamps and Gubert (2007). Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to34 Cluster-R / 58

5. Multi-Way Clustering 5.3 Feasible GLS 5.3 Feasible GLS Two-way random e ects y igh = xigh 0 β + α g + δ h + ε ig with i.i.d. errors xtmixed y x jj _all: R.id1 jj id2:, mle. but cannot then get cluster-robust variance matrix Hierarchical linear models or mixed models richer FGLS y ig = xig 0 β g + u ig β g = W g γ + v j where u ig and v g are errors. see Rabe-Hesketh and Skrondal (2012) Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to35 Cluster-R / 58

5. Multi-Way Clustering 5.4 Spatial Correlation Spatial Correlation Two-way cluster robust related to time-series and spatial HAC. n general bb in preceding has the form i j w (i, j) x i x 0 j bu i bu j. Two-way clustering: w (i, j) = 1 for observations that share a cluster. White and Domowitz (1984) time series: w (i, j) = 1 for observations close in time to one another. Conley (1999) spatial: w (i, j) decays to 0 as the distance between observations grows. The di erence: White & Domowitz and Conley use mixing conditions to ensure decay of dependence in time or distance. Mixing conditions do not apply to clustering due to common shocks. nstead two-way robust requires independence across clusters. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to36 Cluster-R / 58

5. Multi-Way Clustering 5.4 Spatial Correlation Spatial Correlation Consistent VE Driscoll and Kraay (1998) panel data when T! generalizes HAC to spatial correlation errors potentially correlated across individuals correlation across individuals disappears for obs > m time periods apart then w (it, js) = 1 d (it, js) /(m + 1) with sum over i, j, s and t and d (it, js) = jt sj if jt sj m and d (it, js) = 0 otherwise. Stata add-on command xtscc, due to Hoechle (2007). Foote (2007) contrasts various variance matrix estimators in a macroeconomics example. Petersen (2009) contrasts methods for panel data on nancial rms. Barrios, Diamond, mbens, and Kolesár (2012) state-year panel on individuals with spatial correlation across states. And use randomization inference. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to37 Cluster-R / 58

6. Few Clusters 6. nference with Few Clusters One-way clustering, and focus on the Wald t-statistic w = b β β 0. s b β CRVE assumes G!. What if G is small? At a minimum use CRVE with rescaled error eu g = p cbu g where c = G G 1 or c = G G 1 N 1 N k ' G G 1 And use T (G 1) critical values Stata does this for regress but not other commands.. But tests still over-reject with small G. olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to38 Cluster-R / 58

6. Few Clusters 6.1 The Basic Problem with Few Clusters 6.1 The Basic Problem with Few Clusters OLS over ts with bu systematically biased to zero compared to u. e.g. OLS with iid normal errors E[bu 0 bu] = (N K )σ 2, not Nσ 2. Problem is greatest as G gets small - few clusters. How few is few? balanced data; G < 20 to G < 50 depending on data unbalanced data: G less than this. Unusual case. f N is too small with cross-section data, usually everything is statistically insigni cant. With clustered data if G is small we may still have statistical signi cance if N g is small. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to39 Cluster-R / 58

6. Few Clusters 6.2 Solution 1: Bias-Corrected CRVE 6.2 Solution 1: Bias-Corrected CRVE Simplest is eu g = p cbu g, already mentioned. CR2VE generalizes HC2 for heteroskedasticity eu g = [ Ng H gg ] 1/2 bu g where H gg = X g (X 0 X) 1 Xg 0 gives unbiased CRVE if errors iid normal CR3VE generalizes HC3 for heteroskedasticity eu g + = p G /(G 1)[ Ng H gg ] 1 bu g where H gg = X g (X 0 X) 1 Xg 0 same as jackknife Finite sample Wald tests at least use T (G 1) p-values and critical values and not N [0, 1] Example G = 10 F t = 1.96 has p = 0.082 using T (9) versus p = 0.05 using N [0, 1] ad hoc reasonable correction used by Stata. olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to40 Cluster-R / 58

6. Few Clusters 6.3 Solution 2: Cluster Bootstrap with Asymptotic Re nement 6.3 Solution 2: Cluster Bootstrap with Asymptotic Re nement Cameron, Gelbach and Miller (2008) Test H 0 : β 1 = β 0 1 against H a : β 1 6= β 0 1 using w = ( bβ 1 β 0 1 )/s bβ 1 perform a cluster bootstrap with asymptotic re nement then true test size is α + O(G 3/2 ) rather than usual α + O(G 1 ) hopefully improvement when G is small wild cluster percentile-t bootstrap is best better than pairs cluster percentile-t bootstrap. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to41 Cluster-R / 58

6. Few Clusters 6.3 Solution 2: Cluster Bootstrap with Asymptotic Re nement Wild Cluster Bootstrap 1 Obtain the OLS estimator b β and OLS residuals bu g, g = 1,..., G. Best to use residuals that impose H 0. 2 Do B iterations of this step. On the b th iteration: 1 For each cluster g = 1,..., G, form bu g = bu g or bu g = bu g each with probability 0.5 and hence form by g = X 0 g b β + bu g. This yields wild cluster bootstrap resample f(by 1, X 1),..., (by G, X G )g. 2 Calculate the OLS estimate bβ 1,b and its standard error s bβ and given 1,b these form the Wald test statistic wb = ( bβ 1,b bβ 1 )/s b β. 1,b 3 Reject H 0 at level α if and only if w < w [α/2] or w > w [1 α/2], where w [q] denotes the qth quantile of w 1,..., w B. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to42 Cluster-R / 58

6. Few Clusters 6.3 Solution 2: Cluster Bootstrap with Asymptotic Re nement Current Research Webb (2013) proposes using a six-point distribution for the weights d g in bu g = d g bu g. The weights d g have a 1/6 chance of each value in f p p p p p p 1.5, 1,.5,.5, 1, 1.5g. Works better with few clusters than two-point F Two-point cluster gives only 2 G 1 di erent bootstrap resamples. Also with few clusters need to enumerate rather than bootstrap. MacKinnon and Webb (2013) nd that unbalanced cluster sizes worsens few clusters problem. Wild cluster bootstrap does well. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to43 Cluster-R / 58

6. Few Clusters 6.3 Solution 2: Cluster Bootstrap with Asymptotic Re nement Use the Bootstrap with Caution We assume clustering does not lead to estimator inconsistency focus is just on the standard errors. We assume that the bootstrap is valid this is usually the case for smooth problems with asymptotically normal estimators and usual rates of convergence. but there are cases where the bootstrap is invalid. When bootstrapping always set the seed (for replicability) use more bootstraps than the Stata default of 50 F for bootstraps without asymptotic re nement 400 should be plenty. When bootstrapping a xed e ects panel data model the additional option idcluster() must be used F for explanation see Stata manual [R] bootstrap: Bootstrapping statistics from data with a complex structure. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to44 Cluster-R / 58

6. Few Clusters 6.4 Solution 3: mproved T Critical Values Solution 3: mproved T Critical Values Suppose all regressors are invariant within clusters, clusters are balanced and errors are i.i.d. normal then y ig = xg 0 β + ε ig =) ȳ g = xg 0 β + ε g with ε g i.i.d. normal so Wald test based on OLS is exactly T (G L), where L is the number of group invariant regressors. Extend to nonnormal errors and group varying regressors asymptotic theory when G is small and N g!. Donald and Lang (2007) propose a two-step FGLS RE estimator yields t-test that is T (G L) under some assumptions Wooldridge (2006) proposes an alternative minimum distance method. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to45 Cluster-R / 58

6. Few Clusters 6.4 Solution 3: mproved T Critical Values Current Research (continued) mbens and Kolesar (2012) Data-determined number of degrees of freedom for t and F tests Builds on Satterthwaite (1946) and Bell and McCa rey (2002). Assumes normal errors and particular model for Ω. Match rst two moments of test statistic with rst two moments of χ 2. v = ( G j=1 λ j ) 2 /( G j=1 λ2 j ) and λ j are the eigenvalues of the G G matrix G 0 b G. Find works better than 2-point Wild cluster bootstrap but they did not impose H 0. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to46 Cluster-R / 58

6. Few Clusters 6.4 Solution 3: mproved T Critical Values Carter, Schnepel and Steigerwald (2013) provide asymptotic theory when clusters are unbalanced propose a measure of the e ective number of clusters G = G /(1 + δ) F where δ = G 1 G g =1 f(γ g γ) 2 / γ 2 g F γ g = ek 0 (X0 X) 1 Xg 0 g X g (X 0 X) 1 e k F e k is a K 1 vector of zeroes aside from 1 in the k th position if bβ = bβ k F γ = G 1 G g =1 γ g. Cluster heterogeneity (δ 6= 0) can arise for many reasons variation in N g, variation in X g and variation in g across clusters. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to47 Cluster-R / 58

6. Few Clusters 6.4 Solution 3: mproved T Critical Values Brewer, Crossley and Joyce (2013) Do FGLS as gives both e ciency gains and works well even with few clusters. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to48 Cluster-R / 58

6. Few Clusters 6.5 Special Cases 6.5 Special Cases Bester, Conley and Hansen (2009) obtain T (G 1) in settings such as panel where mixing conditions apply. bragimov and Muller (2010) take an alternative approach suppose only within-group variation is relevant then separately estimate β b g s and average asymptotic theory when G is small and N g! A big limitation is assumption of only within variation for example in state-year panel application with clustering on state it rules out z t in y st = x 0 st β + z0 t γ + ε ig where z t are for example time dummies. This limitation is relevant in DiD models with few treated groups Conley and Taber (2010) present a novel method for that case. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to49 Cluster-R / 58

7. Extensions 7. Extensions The results for OLS and FGLS and t-tests extend to multiple hypothesis tests and V, 2SLS. GMM and nonlinear estimators. These extensions are incorporated in Stata but Stata generally does not use nite-cluster degrees-of-freedom adjustments in computing test p-values and con dence intervals F exception is command regress. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to50 Cluster-R / 58

7. Extensions Extensions 7.1 Cluster-Robust F-tests 7.2 nstrumental Variables Estimators V, 2SLS, linear GMM Need modi ed Hausman test for endogeneity : estat endogenous Weak instruments: F F F First-stage F-test should be cluster-robust use add-on xtivreg2 Finlay and Magnusson (2009) have Stata add-on rivtest.ado. 7.3 Nonlinear Estimators Population-averaged (xtreg, pa) and random e ects (e.g. xtlogit, re) give quite di erent βs Rarely can eliminate xed e ects if N g is small. 7.4 Cluster-randomized Experiments Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to51 Cluster-R / 58

8. Empirical Example 8. Empirical Example: Moulton Setting Table 1: Moulton setting. Cross-section individual-level data March 2012 CPS data with state-level regressor and cluster on state. N = 65685 and G = 51. Compare various standard errors for OLS and FGLS (RE). Table 2: 20% subsample of data in Table 1. Now construct a fake dummy and test H 0 : β = 0. Do this for G = 50, 30, 20, 10 and 6 S = 4000 for G 10 and S = 1000 for G > 10. B = 399 (okay for Monte Carlo but set higher in practice). Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to52 Cluster-R / 58

8. Empirical Example olin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to53 Cluster-R / 58

8. Empirical Example Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to54 Cluster-R / 58

8. Empirical Example 8. Empirical Example: BDM Setting Table 3: BDM setting. Panel level state-year data March 1977-2012 CPS data. Aggregated from individual level data using Hansen (2007) method F F OLS regress y its on regresors x its and state-year dummies D ts gives coe cients ey ts OLS regress ey ts = α s + δ t + β d ts + u ts G = 51, T = 36, N = G T = 1836, Compare various standard errors for FE-OLS and FE-FGLS (AR(1)). Table 4: Same data as Table 3. Now construct a fake serially correlated dummy and test H 0 : β = 0. Do this for G = 50, 30, 20, 10 and 6. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to55 Cluster-R / 58

9. Conclusion 9. Conclusion Where clustering is present it is important to control for it. We focus on obtaining cluster-robust standard errors though clustering may also lead to estimator inconsistency. Many Stata commands provide cluster-robust standard errors using option vce() a cluster bootstrap can be used when option vce() does not include clustering. n practice it can be di cult to know at what level to cluster the number of clusters may be few and asymptotic theory is in the number of clusters. Colin Cameron Univ. of California - Davis (JointRobust work with nference Douglas with L. Clustered Miller (UCData - Davis). Based on A Practitioners August 2015 Guide to58 Cluster-R / 58