Empirical Methods in Applied Microeconomics

Size: px

Start display at page:

Download "Empirical Methods in Applied Microeconomics"

Andra Leonard
5 years ago
Views:

1 Empirical Methods in Applied Microeconomics Jörn-Ste en Pischke LSE October Nonstandard Standard Error Issues The discussion so far has concentrated on identi cation of the e ect of interest. Obviously, this always should be the main concern: there is little consolence in having an accurate standard error on a meaningless estimate! Hopefully, the previous chapters will help you to design research projects and emprical strategies with lead to valid estimates. But there are a few important inference issues which arise with the type of cross-sectional and panel data we typically use in applied econometrics. It is therefore time to try to tackle those. This chapter uses somewhat more matrix algebra than the previous ones but will hopefully be equally accessible. 1.1 The Bias of Robust Standard Errors The natural way to compute asymptotic standard errors and t-statistics for regression is using the robust covariance matrix ( P [X i X 0 i ]=N) 1 P [Xi X i^" i ]=N ( P [X i X 0 i ]) 1. Of course, asymptotic covariance matrices, as the name suggests, are only valid in large samples. We have seen already that large samples are always a relative concept, and things may go awry in the samples we use in our actual research. The robust covariance matrix is no exception. Suppose the actual covariance matrix of the population regression residuals is given by E["" 0 jx] = = diag( i ). For the moment the covariance matrix is diagonal, meaning that residuals are independent across observations. We will take up the case of dependent residuals in the following section. The covariance matrix of the OLS estimator is then V = X 0 X 1 X 0 X X 0 X 1 : (1) 1

2 With xed Xs this is the actual covariance matrix applicable to our small sample estimator, not just the asymptotic covariance matrix. The problem is that it involves the unknown i s which we replace by the sample counterparts ^" i in our covariance estimator. Notice that ^" = y X b = y X(X 0 X) 1 X 0 y = I X(X 0 X) 1 X 0 (X + ") = M" where M = I X(X 0 X) 1 X 0 is the residual maker matrix and " is the residual of the population regression. Denoting the i-th column of the matrix M by m i then ^" i = m 0 i ". It follows that E ^" i = E m 0 i "" 0 m i = m 0 im i : Notice that m i is the i-th column of the identity matrix (call it e i ) minus the i-th column of the projection matrix H = X(X 0 X) 1 X 0 (which is also called the hat-matrix, since it makes predicted values). Denote the i-th column of the hat-matrix by h i = x 0 i (X0 X) 1 X 0. Hence, m i = e i h i, and therefore E ^" i = (ei h i ) 0 (e i h i ) = i i h ii + h 0 ih i () where h ii is the i-th diagonal element of the hat-matrix.. Because this matrix is symmetric and idempotent (meaning HH 0 = H), it follow that h ii = h 0 i h i so that we obtain (see Chesher and Jewitt, 1987) E bv V = X 0 X 1 X 0 diag h 0 i ( i I) h i X X 0 X 1 : (3) While V b is biased, it is easy to see that it is a consistent estimator of V. Consider the case of xed Xs again, and focus on the middle bit of the matrix X 0 X. b Notice that b is not consistent for, since there are more and more elements to estimate as the sample gets large. Neverthless, ^" i is consistent for " i since ^" i = y i x 0 b i and b is consisent for (another way to think of this: if we have the entire population instead of a sample, we get the population residual from the population regression). But X 0 b X = 1 N X ^" i x 0 ix i and since plim ^" i = i we get plim X 0 b X = X 0 X. i

3 So why is V b is biased? The reason is that E ^" i is a biased estimate, as we have seen in (). Consider the case where the residual is actually homoskedastic so that i =. In this case () gives E ^" i = hii + h ii = (1 h ii ). The variance of the residual in small samples is too small, and this is related to the quantity h ii. So we need to start by considering some properties of the diagonal elements of the hat-matrix, h ii. They are called leverage because they measure how much pull a particular value of x i exerts on the regression line. Note that by i = h 0 iy = h ii y i + X j6=i h ij y j so if h ii is particularly large the i-th observation will have a large in uence (or leverage ) on the predicted value. In a bivariate regression h ii = 1 N + (x i x) P (xj x) so the leverage is related to how far the x-value of a data point is from the center of the data compared to the general dispersion in the sample. High leverage points are outliers in the x-dimension. Figure 1 illustrates how high leverage points lead to small residuals because such points can pull the regression line a lot without changing the residuals on the other (lower leverage) data points a lot. How much leverage is a lot? Notice that P h ii = trace(h) = rank(h) = k, the number of regressors, since H is an idempotent matrix. Hence 1 N X h ii = k N : i Moreover, h ii < 1, and as h ii! 1, the variance of the i-th residual would shrink to zero, i.e. the regression line would pass exactly through that point. Armed with what we now know about h ii, we can now return to the bias formula (3) for the robust covariance estimator. This formula highlights two things. First, the bias depends on the form of, i.e. the actual variances of the population residuals, which is in general unknown. If we knew, we could compute the correct standard errors using (1) directly and there would be no need to resort to the robust covariance matrix. If we do not know, there is no way of knowing the exact extent of the bias of the robust covariance matrix. The second ingredient in the bias are the vectors h i from 3

4 high leverage point x Figure 1: High leverage points lead to small residuals the projection or hat matrix. So the second thing we learn from (3) is that the bias will be worse if there are large x-outliers in our data, and in particular when the leverage of an observation is related to the variance of the residual. What can be done to improve the performance of the robust covariance matrix? There are a number of suggestions in the literature. Denoting the robust covariance matrix estimator by ( P [X i X 0 i ]=N) 1 P [Xi X ii b ]=N ( P [X i X 0 i ]) 1 the alternative forms use alternative values for b i : HC 0 : i b = ^" i HC 1 : i b = N N k^" i HC : i b 1 = ^" i 1 h ii HC 3 : b i = 1 (1 h ii ) ^" i : HC 0 yields the covariance estimator suggested by White (1980). HC 1 is a 4

5 simple degrees of freedom correction, which helps in small samples. HC uses the leverage to give an unbiased estimate of the variance estimate of the i-th residual. HC 3 is an approximation to a jacknife estimator suggested by MacKinnon and White (1985). In many cases the size of the calculated standard errors from HC j will be larger the larger j, but there is no guarantee that the ordering is of that particular form with actual data. These alternative estimators are often implemented in modern regression packages. 1 Even when they are not, they are easy to compute using a trick suggested q by Messer and White (1984). This amounts to dividing y i and X i by b i and then running an IV regression with these transformed variables, q instrumenting X i = qb i by X i b i for the appropriate choice of b i. In order to gain some insight into these various versions of the robust covariance estimator, consider a very simple regression design: y i = + d i + " i (4) where d i is a dummy variable. in this regression estimates the di erence in the means in the two subsamples de ned by the dummy variable. Denoting these subsamples by the subscripts 0 and 1, we have b = y 1 y 0 : Furthermore, let p = E(d i ). We will treat the dummy as a xed covariate, so that p = N 1 =N and 1 p = N 0 =N. We discuss this example because it is an important one in statistics, and we know a lot about the small sample properties for the di erence in means. When y i is distibuted normally with equal but unknown variance in the two subsamples, then the t statisitic for the di erence in means has a t-distribution: this is the classic two sample t-test. However, we are concerned with the possibility that there is heteroskdasticity, meaning that the variances in the two subsamples are di erent. If nothing is known about these two variances the testing problem in small samples becomes intractable: the exact small sample distribution for this problem is not known. This is known as the Behrens-Fisher problem (see e.g. DeGroot, 1986, p ). The di erent robust covariance estimators HC 0 - HC 3 are di erent responses to guring out the standard error for this testing problem. 1 For example, the STATA package computes HC 1, HC, and HC 3. Another suggestion to improve the small sample performance of the covariance estimator is bootstrapping. Horowitz (1997) advocates a form of the bootstrap called the wild bootstrap in this context. 5

6 De ne Sj = P d i =j y i y j for j = 0; 1. The diagonal elements of the hat-matrix in this particular case are 1=N0 if d h ii = i = 0 1=N 1 if d i = 1 ; and it is straightforward to show that the ve covariance estimators are N S OLS : 0 + S 1 1 S = 0 + S 1 N 0 N 1 N Np(1 p) N HC 0 : HC 1 : S0 N0 + S 1 N1 N N S 0 N 0 + S 1 N1 HC : HC 3 : S 0 N 0 (N 0 1) + S 1 N 1 (N 1 1) S 0 (N 0 1) + S 1 (N 1 1) : The standard OLS estimator pools the observations from both subsamples to derive the variance estimate: this is the e cient thing to do when the two variances are actually the same. The White (1980) estimator HC 0 adds the estimates of the two sampling variances of the means, using the consistent (but biased) maximum likelihood estiamte of variance. The HC estimator is the unbiased estimator for the sampling variance in this case, since it makes the correct degrees of freedom correction. HC 1 makes the degrees of freedom correction outside the sum, which will help but generally not be quite correct. Since we know HC to be the unbiased estimate of the sampling variance, we also see immediately that HC 3 will be too big. Even though we know the exact unbiased estimator for the sampling variance in this case, we still don t know the small sample distribution of the test statistic y 1 y q 0 ; S 0 N 0 (N 0 1) + S1 N 1 (N 1 1) the Behrens-Fisher problem. Note that p = 0:5 implies that the regression design is perfectly balanced. In this case, the OLS estimator will be equal to HC 1, and all ve estimators will generally di er little. To provide some further insights, we present some results from a small Monte Carlo experiment for the model (4). Whe choose N = 30, since 6

7 this will highlight the small sample issues, and p = 0:9, which implies h ii = 10=N = 1=3 if d i = 1, in order to have a relatively unbalanced design. We draw N(0; " i ) if d i = 0 N(0; 1) if d i = 1 and we show results for two cases. The rst has relatively little heteroskedasticity, and we set = 0:85, while the second has lots of heteroskedasticity with = 0:5. Table displays the results. The columns mean and standard deviation display means and standard deviations of the various estimators across 5,000 replications of the sampling experiment. The standard deviation of b is the sampling variance we trying to measure. Even with little heteroskedasticity the OLS standard erros are too small by about 15%. However, HC 0 and HC 1 are even smaller because of the small sample bias. HC is slightly bigger than the OLS standard errors on average. Notice that this estimator of the sampling variance is unbiased while the mean of the HC standard errors across sampling experiments (0.54) is still below the standard deviation of b (0.60). This comes from the fact that the standard error is the square root of the sampling variance, the sampling variance is itself estimated and hence has sampling variability, and the square root is a concave function. The HC 3 standard error is slightly too big, as we expected. The last two columns in the table show empirical rejection rates for the hypothesis b =, using a nominal size of 5% for the test. Since we don t know the exact small sample distribution, we compare the test statistics to the normal distribution (which is the asymptotic distribution) and to a t-statistic (which is not the correct small sample distribution in this case for any of the estimators, as we have seen). Rejection rates are far too high for all tests. Interestingly, with little heteroskedasticity OLS standard errors have lower rejection rates than the robust standard errors, even though the standard errors themselves are are smaller than HC and HC 3 on average. But the standard errors themselves are estimated and have sampling variability. The OLS standard errors are much more precisely estimated than the robust standard errors, as can be seen from column (). 3 This means the robust standard errors will sometimes be too small by accident and this happens often enough in this case to make the OLS 3 The large sampling variance of the robust estimators has also been noted by Chesher and Austin (1991). Kauermann and Carroll (001) propose an adjustment to the con - dence interval to correct for this. 7

8 standard errors preferred. The lesson we can take a away from this is that robust standard errors are no panacea. They can be smaller than OLS standard errors for two reasons: the small sample bias we have discussed, and the higher sampling variance of these standard errors. Hence, if we observe robust standard errors being smaller than OLS standard errors we know this to be a warning ag. If heteroskedasticity was present and our standard error estimate was about right, this wouldn t happen. With lots of heteroskedasticity as in the lower panel of the table things are di erent: OLS standard errors are now dominated by the robust standard errors throughout, although the empirical rejection rates are still way too high to give us much con dence in our con dence intervals for any of the estimators. Using the t-distribution rather than the normal helps only marginally. There doesn t seem to be any clear way out of this conundrum. Standard error estimates might be biased in nite samples. OLS standard errors because of heteroskedasticity, and robust standard errors because of the in uence of high leverage points. Hence, if the regression design is possibly unbalanced, the only prescription for the applied researcher is to check the data for high leverage points. If the regression design is relatively balanced then robust standard errors should produce fairly accurate con dence intervals even in small samples. One hopeful observation on robust standard errors is that we have rarely seen them di er from OLS standard errors in emprical practice by more than something like 5%. In any applied project there are always myriads of speci cation choices to be made from selection of the sample, to the exact treatment of the variables, regression design, etc. These certainly produce non sampling variation in our estimates of a similar magnitude (in the sense that our estimates would di er if we repeated the project with slightly di erent choices). Hence, although we strife to get our standard errors as right as possible, if they end up being biased by something in the order of 5% this would probably not keep us up at night. But this is only true in the case of independent observations. Things can be much worse when observations are dependent. 1. Clustering and Serial Correlation in Panels 1..1 Clustering and the Moulton Factor The more serious problems have to do with correlation of the residuals across the units of observation. Start by considering the simple model y ig = + x g + " ig (5) 8

9 where the outcome is observed at the individual level but the regressor of interest, x g, varies only at a higher level of aggregation, a group g, and there are G groups. For example, y ig could be the test score of a student, and x g is class size, where i denotes the student, and g denotes the class room. If x g is as randomly assigned, as in the Tennessee STAR experiment (Krueger, 1999), then the OLS estimator is unbiased and consistent for the population regression coe cient. Recall that the 1,000 students in Kindergarten to garde 3 were randomly assigned to small or regular classes in the STAR experiment. What we are worried about in an analysis of the STAR data is that the error term has a group structure: " ig = v g + ig : (6) The class room level component could result from the fact that a class may have had a particularly good teacher, or a class took the test when there were a lot of external disruptions, so that all students performed more poorly than alternative classes. This problem of correlation in the errors is, of course, well known in econometrics. Kloek (1981) and in particular Moulton (1986), however, pointed out how important it can be for applied research in the grouped regressor case. Following their derivations, it is straightforward to analyze this case. The algebra needs some extra notation, and is therefore exiled to an appendix. Let v = v + : Given the structure (6) is called the intra-class correlation (even in cases where the groups are not class-rooms!). When the groups are of equal size n, we have var( ) b var ( ) b = 1 + (n 1) (7) where var( b OLS ) is the true variance of the OLS estimator and var ( b OLS ) is the conventional OLS variance. Notice that the OLS standard error formula will be worse if n is large and if is large. To see the intuition, consider the case where! 1. In this case, all the errors within a group are the same. This is just like taking a data set and making n identical copies. The covariance matrix of the replicated data set is going to be 1=n times the original covariance matrix, although no information has been added. In order to see how this problem is related to the group structure in the regressor x, consider the generalization of (7) where the regressor is x ig, 9

10 which varies at the individual level, but is correlated within groups, and the group sizes n g vary by group. In this case var( b ) var ( b ) x = var(ng ) = 1 + n P P g + n 1 x (8) i6=k (x ig x) (x kg x) var(x ig ) P g n : g(n g 1) x is the intra-class correlation of x ig, and it is actually unrestricted and does not impose a form like in (6). What the formula says is that the bias in the OLS formula is much worse when x is large but vanishes when x = 0: If the x ig s are uncorrelated within groups, the error structure does not matter for the estimation of the standard errors. In order to see that this problem can be quite important, return to example of estimating the e ect of class size on student achievement with the Tennessee STAR data. For illustration, we will just run (5) by OLS, although a fair bit of the variation in class size comes from non-random factors. A simple regression of the percentile score for Kindergarteners on their class size yields an estimate of with a robust standard error of Now consider the formula (8). Even though x = 1, classes are of unequal size. Plugging all the relevant values into the formula we get var( b ) var ( b ) = :13 19:4 + 19:4 1 0:311 = 7:01: This implies that our standard error estimate is too large by a factor of :65 = p 7:01. The corrected standard error is The same problem arises in IV estimation. Consider the regression equation y ig = + x ig + " ig where the regressor can now vary at the individual level. Let Z be a matrix of instruments which only vary at the group level. It is easy to show that the Moulton formula for the IV case is the same as (8) for the grouping of the instrument (Shore-Shepard, 1996). Hence it is equally important to address this problem at the level of an instrumental variable as it is for a regressor in the OLS case. Another setting where this problem might pop up is the regression discontinuity design if the confounder, x i, is measured at a group level, and not the individual level (see Card and Lee, 007). 4 The IV coe cient estimate, where class size is instrumented with two dummies for the assignment to regular and regular with aide groups, is almost identical. 10

11 There are various solutions to this problem: 1. Parametric correction: Obtain an estimate of and calculate the standard errors using the correct formula given by (8). The intra-class correlations and x can typically be estimated easily in statistical software. 5. Clustered standard errors: A non-parametric correction for the standard errors is given by the following extension of the robust covariance matrix (Liang and Zeger, 1986): var( ) b = X 0 X! X 1 X gg b X g X 0 X 1 (9) g b" 3 1g b" 1g b" g b" 1g b" ngg b g = qb" g b" 0 g = b" 1g b" g b" g b"(ng 1)gb" ngg 5 : b" 1g b" ngg b" (ng 1)gb" ngg b" n gg where X g is the matrix of regressors for group g, and q is a degrees of freedom adjustment factor like G=(G 1) similar to the one in HC 1 for the simple heteroskedasticity robust covariance matrix above. This calculation of the covariance matrix allows for arbitrary correlation of the errors within the clusters g, not just the structure in (6). Clustered standard errors will be consistent as the number of groups gets large. 3. Aggregation to the group level: Calculate y g rst and then run a weigthed least squares regression y g = + x g + " g with the number of observations in the group as weights (or the inverse of the sampling variance of y g ). For the correct choice of weights this is equivalent to doing OLS on the micro data. The error term at this aggregated level is " g = v g + g, and the error component v g is therefore considered in the usual second step standard errors so that inference can be based directly on the second step covariance matrix. 6 5 For example, using the loneway command in Stata. 6 See Wooldridge (003) and Donald and Lang (007). While the aggregate regression is simply the between regression in the context of a random e ects model, long known to econometricians, the rst discussion of the analogy of the micro and group level regression and the relationship to inference is probably in Kloek (1981). 11

12 If there are other micro level regressors in the model, as in y ig = g + x g + W ig + " ig ; we can do the aggregation by running the regression y ig = 0 g + W ig + " 0 ig which includes a full set of group dummies. The b 0 g coe cients on the group dummies are our group means, purged of the e ect of the individual level variables W ig. Obviously, aggregation does not work when x ig varies within group. Averaging the x ig s to group means is IV, and hence involves changing the estimator. 4. Block bootstrap: Bootstrapping means to draw random samples from the empirical distribution of the data. Since the best representation of the empirical distribution of the data is the data itself, this means in practice for a sample of size N, to draw another sample of size N with replacement from the original data set. This can be done many times, and an estimate is computed for all the bootstrap samples. The standard error of the estimate is the standard deviation of the estimates across all the bootstrap samples. In block bootstrapping, the bootstrap draws will be whole blocks of data as de ned by the groups g. Hence, any correlation across the errors within the block will be kept intact with the block bootstrap sampling, and should therefore be re ected in the standard error estimate. There are many di erent ways to do bootstrap inference, for more on this see Cameron, Gelbach, and Miller (006). 5. Estimate a random e ects GLS or ML model of equation (5). This relies on the on the linearity of the CEF, and we prefer the simple OLS approximation to the conditional expectation function, so we do not recommend this approach. Table 8..1 returns to the class size example from the STAR experiment, which we have discussed in this section. The table presents six di erent estimates of the standard errors: conventional robust standard errors (using HC 1 ), two versions of the parametrically corrected standard errors using the Moulton formula (8), the rst using the formula for the intra class correlation given by Moulton, and the second using an alternative ANOVA estimator of the intra class correlation, 7 clustered standard errors, block bootstrapped 7 This is computed using the loneway command in STATA. 1

13 standard errors, and estimates aggregated to the group level. Columsns (1) and () present the results on the class size regressor while columns (3) and (4) of show the estimates on an individual level covariate we have included in the regression: sex. Class rooms are almost balanced by sex, and hence there is (almost) no intra class correlation in this regressor. As a result, the standard error estimates for this regressor are not a ected by any of the corrections. As we have seen before, the adjustment to the standard error on the class size regressor are large but all the di erent adjustments deliver standard errors that are also almost identical. There are 318 class rooms in the data set, which is a large number, and all these methods should deliver similar results with a large number of clusters. Hence we tend to use clustered standard errors in practice, because they are conveniently available in many regression packages, and hence easy to compute. 8 The aggregation approach also has much to commend itself, if only because it often allows to plot the data easily in the second stage. With a small numbers of groups there is a new set of concerns to worry about, and we turn to this in section 1..3 below. 1.. Serial Correlation in Panels and Di erence-in-di erence Models Now suppose that there are only two groups, i.e. the regressor of interest is a dummy variable: y ig = + d g + v g + ig : (10) The Moulton problem does not arise in this case, because OLS ts the regression line perfectly through the two points de ned by the dummy variable. To see this notice that so that E (y ig jd g ) = + d g + E (v g jd g ) b = E (y ig jd g = 1) E (y ig jd g = 0) = + E (v g jd g = 1) E (v g jd g = 0) : Since E (v g jd g ) = 0 this means that the estimate of will be unbiased but it will not be consistent, as pointed out by Donald and Lang (007). In 8 In fact, the name clustered standard errors, which applied researchers have adopted, derives from the name of the Stata option. 13

14 every new sample, there will be a new draw of v g. So the regression line will be somewhat o, and the estimate will not exactly equal the population. However, on average, there will be no bias: sometimes will be overestimated, sometimes underestimated. Now suppose we let N! 1, while G, the number of groups, remains constant at. The bias that exists in any particular sample will not go to zero, because v g is just as imporant in the big sample as in the small sample. Only the sampling variation due to it will vanish, not the sampling variation due to v g. In a sense, the Moulton problem discussed above arises precisely from the fact that the regression line will not neatly t through all the points de ned by v g when there are three or more groups. This problem also arises in the standard x di erence-in-di erence model. Recall from section?? that we modeled the outcome as an additive function of state e ect, a time e ect, and the treatment e ect on the interaction E[y i js; t] = s + t +d st. Now consider the case where there is a state-time speci c component to the error term: y ist = s + t + d st + v st + ist : (11) Because the model is saturated this is no di erent from the model (10) for the purpose of inference. As before, the error component v st does not vanish even when N! 1, i.e. the group sizes are large. Moreover, there is really no way to get consistent standard errors which acknowledge this problem because d st and v st are completely collinear. So no separate estimate of and v st is possible. This means that x di erence-in-di erences are not really very informative if v st shocks are important. An example of this problem is the basic analysis of the employment e ects of the New Jersey minimum wage in the original Card and Krueger (1994) New Jersey-Pennsylvania comparison. Card and Krueger compared employment at New Jersey and Pennsylvania fast food restaurants before and after New Jersey introduced a state minimum wage. With two states and two periods this is the standard x DD design. The solution to this problem is to have either multiple time periods on two states, as in the Card and Krueger (000) reanalysis of the New Jersey-Pennsylvania experiment with a longer time series of payroll data, or multiple contrasts for two time periods, like in Card (199) using 51 states. It is straightforward to get correct standard errors if v st is iid by using one of the methods discussed in the previous section. In many applications of the di erence-in-di erence model there will be both multiple treatment groups (s) and multiple time periods (t). Bertrand, 14

15 Du o, and Mullainathan (004) and Kézdi (004) point out a further problem in this case. Many economic variables of interest tend to be correlated over time. This means that v st is most likely serially correlated. Consider the Card and Krueger example again and imagine using the data from Card and Krueger (000) which span the period from October 1991 to September This yiels 7 monthly observations for each state. But these 7 observations are not independent: employment variations tend to be highly correlated over time. For example, we saw in the DD notes before that employment in Pennsylvania was consistently lower than in New Jersey for most of 1994 and Hence, the solutions which treat v st as iid are not su cient. Bertrand et al. (004) investigate a variety of remedies, like clustering at the state level, block bootstrap methods at the state level, ignoring the time series information by aggregating the data into two periods, or parametric modeling of the serial correlation. An interesting and important result is that clustering standard errors at the state level solves the serial correlation problem. In the previous section we would have treated the state*month cell as the cluster because the variation in the key regressor is at the state*month level. Instead, treat the entire state as the cluster. This might seem odd at rst glance, since we have already controlled for state e ects. The state dummy s in (11) already removes the time mean of v st which is v s. Nevertheless, this method solves the serial correlation problem because v st v s will still be correlated for adjacent periods. Clustering at the state*month level does not address this because residuals across clusters are treated as independent. But clustering at the state level allows for this since this covariance estimator allows a completely non-parametric residual correlation within clusters. Clustered standard errors serve a very di erent role here than in the standard Moulton case (5) but they work. The conclusion is that correlated errors are likely to be a problem in many panel type applications and adjusting the standard errors for this correlation is important. Donald and Lang (004), Bertrand et al. (004), and Kézdi (004) highlight the issue that we may want to treat standard errors as clustered at a high level of aggregation. As a result we may end up with relatively few clusters Few clusters The problem of few clusters is the analogue to the small sample problem of robust standard errors discussed in section 1.1. Here as there, small sample distributions for the di erent estimators are not available but we know (from 15

16 Monte Carlo evidence) that all the adjustments can be biased substantially when there are only few clusters. Donald and Lang (007) and Cameron, Gelbach, and Miller (006) discuss inference when the resulting number of groups is small, see also Hansen (007b). This area is very much research in progress, and rm recommendations are therefore di cult. The main approaches are 1. Bias corrections of clustered standard errors. Clustered standard errors are biased in small samples because E b" g b" 0 g 6= E "g " 0 g = g just as in Section 1.1. One solution to the bias problem is to use an adjustment just as in Section 1.1 to correct for the small sample bias. As before, the bias depends on the form of g. Bell and McCa rey (00) suggest to adjust the residuals by where A solves and b g = qe" g e" 0 g e" g = Ab" g A 0 ga g = (I H g ) 1 H g = X g (X 0 X) 1 X 0 g is the hat-matrix at the group level. This is the analog to HC for the clustered case. However, the matrix A g is not unique; there are many such decompositions. Bell and McCa rey (00) suggest to use the symemtric square root of (I H g ) 1 or A g = P 1= where P is the matrix of eigenvectors of (I H g ) 1, is the diagonal matrix of the correponding eigenvalues, and 1= is the diagonal matrix of the square roots of the eigenvalues. One problem with the Bell and McCa rey adjustment is that (I H g ) may not be of full rank, and hence the inverse may not exist for all designs. This happens, for example, when one of the regressors is a dummy variable and this dummy variable only takes on values of zero or one within the group. In addition, the dimenstion of H g is the number of observations per group. Since this matrix needs to be inverted this only tends to work if the group sizes are reasonably small. 16

17 . Various authors, including Bell and McCa rey (00) and Donald and Lang (007) suggest to base inference on a t-distribution with G k degrees of freedom, where k is the number of regressors, rather than on the standard normal distribution. We have seen in Section 1.1 that this is not generally the correct small sample distribution even if the errors v g are normally distributed. Neverthless, for small G this makes a substantial di erence. Cameron, Gelbach, and Miller (006) nd that this works well in conjunction with the Bell and McCa rey (00) bias correction as described in 1 for the Moulton problem. 3. Donald and Lang (007) suggest that aggregating to the group level works well even with a small number of groups in conjunction with using a t-distribution with G k degrees of freedom. Straight aggregation does not work to solve the serial correlation problem in panels discussed in Bertrand et al. (004) and Kezdi (004). 4. Cameron, Gelbach, and Miller (006) report that various forms of the bootstrap work well with small numbers of groups, and typically outperform clustered standard errors without the bias correction. They point out, however, that the bootstrap does not always lead to improved small sample statistics. In order to get such an improvement they suggest to bootstrap Wald statistics directly, rather obtain the test statistic based on bootstrapped standard errors. They also recommend a method called the wild bootstrap. Rather than resampling entire groups (y g ; X g ) of data, this involves computing a new yg based on the residual b" g = y g X g, b where y g = X g b + b" g. This implies that the X g s are being kept xed and only a new residual b" g is chosen in each bootstrap replication. In the wild bootstrap b" g = b" g with probability 0.5 and b" g = b" g with probability Hansen (007a) proposes parametric methods to solve the serial correlation problem discussed above. I.e. model the error process as an AR process, estimate the AR parameters and x the covariance matrix. Hansen points out that the AR parameters are biased in panels of short duration, and demonstrates the importance to use a bias adjusted estimator for these coe cients. His methods seem to be yield much improved inference compared to Bertrand et al. s (004) investigation of parametric models without bias adjustment. Although Hansen (007a) does not explicitly demonstrate the performance of his estimator with a small number of groups one would generally expect 17

18 more parametric methods to be less sensitive to sample size as the nonparametric ones, like clustering, as long as the parametric assumptions are roughly right. Various authors have demonstrated that ignoring the problem of a small number of clusters can lead to very misleading inference. Nevertheless, there seems to be no single x which solves this problem satisfactorily. We have seen above, in the case of the simple robust covariance matrix, section 1.1, that xing the bias problem tends to introduce variance into the covariance estimator. Hence, trying to x the bias may sometimes lead to smaller standard errors than ignoring the problem. Whether the few clusters problem leads to a lot of bias also seems to depend on the situation. For example, Monte Carlo results in Hansen (007b) suggest that the bias is a lot worse for the standard Moulton problem than for the serial correlation problem. This suggests that it may be feasible to stick with regular clustered standard errors to solve serial correlation in panels, even when the number of clusters is as small as 10. For solving the Moulton problem in section 1..1, it seems more imporant to worry about clustered standard errors with a small number of clusters. However, Donald and Lang (007) aggregation seems to work well in this case as long as the regressor of interest is xed within groups. This would also be our preferred strategy when both problems occur in combination, i.e. when the estimation is based on micro data, but treatment only varies at the state (or some other aggregate) level over time. In this case, aggregate the observations rst to the state-year level, and then cluster standard errors in the aggregate panel at the state level. The uphsot from all this is that it may be important to pay attention to small sample issues in applied microeconometric work. Working with large micro data sets, we used to sneer at the macro economists with their small time series samples. But he who laughs last laughs best: it turns out that it is the macro economists who had it right all along, and we micro economists are now often con ned to the same small sample sizes as they are. The key is to think about where your variation lives. Unfortunately, all too often it lives at a fairly aggregate level. Which methods work best in particular applications when the original data result from large micro data samples is still an open question, and this remains an area of active research. References Bell and Maca rey (00) Bias reduction in standard errors for linear regression with multi-stage samples, Survey Methodology, 00 18

19 Card, David, and David Lee (007) Regression Discontinuity Inference with Speci cation Error, Journal of Econometrics, 007. Chesher, Andrew and Ian Jewitt (1987) The bias of the heteroskedasticity consistent covariance estimator, Econometrica 55, Chesher, Andrew and Gerald Austin (1991) The nite-sample distributions of heteroskedasticity robust Wald statistics, Journal of Econometrics 47, DeGroot, Morris (1986) Probability and Statistics, nd edition, Reading: Addison Wesley. Hansen Christian (007a) Generalized Least Squares Inference in Multilevel Models with Serial Correlation and Fixed E ects, Journal of Econometrics. Hansen, Christian (007b) Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data when T is Large, Journal of Econometrics. Kauermann, Göran and Raymond J. Carroll (001) A note on the E - ciency of Sandwich Covariance Estimation, JASA, 96, Brent Moulton, Random Group E ects and the Precision of Regression Estimates, Journal of Econometrics, 3, pp L. Shore-Sheppard, The Precision of Instrumental Variables Estimates with Grouped Data, Industrial Relations Section Working Paper #374, Princeton University, 1996 Marianne Bertrand, Esther Du o, and Sendhil Mullainathan, How Much Should We Trust Di erences-in-di erences Estimates? Quarterly Journal of Economics, vol. 119, February 004, pp K. Liang and Scott L. Zeger, Longitudinal Data Analysis Using Generalized Linear Models, Biometrika 73 (1986), 13-. Colin Cameron, Jonah Gelbach, and Douglas L. Miller Bootstrap-Based Improvements for Inference with Clustered Errors, mimeographed, 006 T. Kloek (1981) OLS Estimation in a Model Where a Microvariable is Explained by Aggregates and Contemporaneous Disturbances are Equicorrelated. Econometrica, Vol. 49, No. 1. (Jan., 1981), pp Davidson and MacKinnon (1993) Estimation and Inference in Econometrics, New York and Oxford: Oxford University Press. Stephen G. Donald, Kevin Lang (007) Inference with Di erence-in- Di erences and Other Panel Data. Review of Economics and Statistics May 007, Vol. 89, No. : Joel L. Horowitz (1997) Bootstrap Methods in Econometrics: Theory and Numerical Performance, in: Kreps and Wallis (eds.) Advances in Economics and Econometrics: Theory and Applications, Seventh World Congress, vol III, Cambridge: Cambridge University Press,

20 MacKinnon and White (1985) Some heteroskedasticity consistent covariance matrix estimators with improved nite sample properties. Journal of Econometrics 9, Messer and White (1984) A note on computing the heteroskedasticity consistent covariance matrix using instrumental variables techniques. Oxford Bulletin of Economics and Statistics, 46, Kezdi (004) Robust Standard Error Estimation in Fixed-E ects Panel Models, Hungarian Statistical Review, Special English Volume #9, 004. pp Wooldridge (003) Cluster-sample methods in applied econometrics, American Economic Review. May 003. Vol. 93, Iss. ; p Appendix In order to derive (7), write y g = 6 4 y 1g y g " g = 6 4 " 1g " g y ngg " ngg and y = 6 4 y 1 y. y G x = x 1 x. G x G " = 6 4 " 1 ". " G

21 where g is a column vector of n g ones and G is the number of groups. Notice that E("" 0 ) = = G 3 1 g = 1. " = " (1 )I + g 0 g 1 Now = v v + : X 0 X = X g X 0 X = X g n g x g x 0 g x g 0 g g g x 0 g: But Denote g = 1 + (n g x g 0 g g g x 0 g = "x g 0 g (n g 1) 1 + (n g 1) 1 + (n g 1) 3 = "n g [1 + (n g 1)] x g x 0 g: 1), so we get x g 0 g g g x 0 g = "n g g x g x 0 g X X 0 X = " n g g x g x 0 g: g 7 5 x0 g With this at hand, we can compute the covariance matrix of the OLS estimator, which is var( b OLS ) = X 0 X 1 X 0 X X 0 X 1! 1 X X = " n g x g x 0 g n g g x g x 0 g g 1 g 1 X n g x g xg! 0 : g

22 We want to compare this with the standard OLS covariance estimator var ( b OLS ) = " 1 X n g x g xg! 0 : g If the group sizes are equal, n g = n and g = = 1 + (n 1) so that var( b OLS ) = " = "! 1 X X nx g x 0 g nx g x 0 g g X nx g x 0 g = var ( b OLS ); g g! 1 X nx g x 0 g g! 1 which implies (7).

Bootstrap-Based Improvements for Inference with Clustered Errors

Bootstrap-Based Improvements for Inference with Clustered Errors Colin Cameron, Jonah Gelbach, Doug Miller U.C. - Davis, U. Maryland, U.C. - Davis May, 2008 May, 2008 1 / 41 1. Introduction OLS regression