A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES

A COMPARISON OF HETEROSCEDASTICITY ROBUST STANDARD ERRORS AND NONPARAMETRIC GENERALIZED LEAST SQUARES MICHAEL O HARA AND CHRISTOPHER F. PARMETER Abstract. This paper presents a Monte Carlo comparison of several versions of heteroscedasticity robust standard errors (HRSEs) to a nonparametric feasible generalized least squares procedure (NPGLS). Results suggest that the NPGLS procedure provides an improvement in efficiency ranging from 3% to 12% or more in reasonable sample sizes using simple functional forms for heteroscedasticity. This results in tighter confidence intervals and more precise estimation and inference. Thus, the NPGLS estimator provides nearly identically sized hypothesis tests, with a significant gain in power. JEL Classification: C13 (Estimation), C14 (Semiparametric and nonparametric methods) 1. Introduction Econometric estimation focuses on the construction of a consistent and efficient estimator. In the presence of heteroscedasticity ordinary least squares (OLS) estimation of a correctly specified parametric model is consistent but inefficient. It is straightforward to show that implementing a generalized least squares (GLS) procedure will produce an efficient estimator if the functional form of the heteroscedasticity is known up to a finite dimensional parameter. Under misspecification of this conditional variance function, feasible GLS still provides a consistent estimator of the unknown regression parameters, however, the misspecified scedastic function is capable of producing an estimator whose efficiency is in question. The resulting estimator can have a variance-covariance matrix that is larger than the initial OLS estimator that ignored the presence of heteroscedasticity at the onset. Two competing approaches exist for this vexing problem. The first has received little attention in applied econometric work, while the other has become the workhorse of empirical econometric research. Nonparametric generalized least squares (NPGLS) (Robinson 1987) follows along the same lines as a traditional FGLS procedure, but instead of positing a functional form for the conditional variance, it constructs a nonparametric estimate of it prior to reweighting in the second application of least squares. This method, while attractive, has paled in popularity and empirical favor to the heteroscedasticity robust standard error (HRSE) construction following the seminal work of White (1980). White s approach is to use the residuals stemming from OLS to construct a variance-covariance matrix for the OLS estimator that is robust to all forms of heteroscedasticity. That is, the robust standard errors are based off of the OLS variance-covariance matrix as opposed to that stemming from GLS. Thus, even though the standard errors (and corresponding test statistics) are robust in the presence of heteroscedasticity of unknown form, they are produced from the inefficient OLS estimator Date: February 22, 2011. Key words and phrases. Bandwidth Selection, Monte Carlo Experiment, Simulations. 1

2 MICHAEL O HARA AND CHRISTOPHER F. PARMETER as opposed to the fully efficient GLS estimator. The wide spread use of the HRSEs as opposed to NPGLS is summed up eloquently by Angrist and Pischke (2010, pg. 12): Robust standard errors, automated clustering, and larger samples have also taken the steam out of issues like heteroskedasticity and serial correlation. A legacy of White s (1980) paper on robust standard errors, one of the most highly cited from the period, is the near death of generalized least squares in cross-sectional applied work. In the interests of replicability, and to reduce the scope for errors, modern applied researchers often prefer simpler estimators though they might be giving up asymptotic efficiency. Given the now widespread availability of canned nonparametric software (Hayfield and Racine 2008) as well as the overall steam that these methods are starting to gain in mainstream applied econometric work (Henderson 2009; Li, Racine and Wooldridge 2009) it would seem an interesting study to compare exactly the loss in asymptotic efficiency relative to the computational gains in a comparison of HRSEs versus traditional standard errors stemming from NPGLS. This paper marks one of the first attempts to rigorously compare these two competing methods to determine their relative performance and if NPGLS indeed has a place at the applied dinner table. We perform Monte Carlo simulations of several forms of heteroscedasticity and compare the performance of several versions of HRSEs to the NPGLS procedure. Our results show that NPGLS provides confidence intervals that contain the true parameter value at essentially the same rate as HRSE but are narrower by a factor ranging from 5 to 15% for simple functional forms of heteroscedasticity, and significantly more than this for some more complex forms. In the context of inference based on the estimated standard errors, this provides a significant gain in power with no significant loss in size at reasonable sample sizes. The organization of the paper is as follows. Section 2 provides a description of the econometric estimators we deploy in our simulation study. Section 3 details our simulation setup while Section 4 presents our findings. Concluding comments appear in Section 5. 2. The Estimators Our starting point is the k variate linear model (1) Y i = X i β + ε i, i = 1,...,n where i indexes observations and β is an unknown R k dimensional parameter vector. We assume that E[ε i ] = 0 i and V (ε i ) = σ 2 (X i ), which is allowed to differ across observations. This gives us E [εε ] = Ω, which is a diagonal matrix with j th diagonal element equal to σ 2 (X j ) The best linear unbiased estimator of β is the GLS estimator (2) ˆβGLS = (X Ω 1 X) 1 X Ω 1 Y, where X is the n k matrix with i th row corresponding to X i, Y is the n 1 vector with i th element Y i. As it stands the GLS estimator is infeasible because of the presence of the unknown σ 2 (X j ). Several strategies are available to construct a feasible estimator. The most common is to assume a parametric form for σ 2 (X j ), use the squared residuals (ˆε = Y i X i ˆβ) as observations for σ 2 i and estimate the parameters of the scedastic function. These fitted conditional variances are then used to construct Ω and the FGLS estimator is constructed

as in (2) but with Ω being replaced with Ω. That is, (3) ˆβFGLS = (X Ω 1 X) 1 X Ω 1 Y. NONPARAMETRIC GLS 3 Under correct specification of σ 2 (x) this will produce an asymptotically equivalent estimator as the oracle GLS and is efficient in the class of unbiased linear estimators. However, misspecification will result in an estimator that could produce an estimator inefficient to the OLS estimator. It is well known that the OLS estimator of (1), (4) ˆβOLS = (X X) 1 X Y, has variance-covariance matrix (5) V (ˆβ OLS ) = (X X) 1 (X ΩX)(X X) 1. The GLS estimator has variance-covariance matrix (6) V (ˆβ GLS ) = (X Ω 1 X) 1, which can be shown to differ from V (ˆβ OLS ) by a positive semidefinite matrix. White (1980) suggested replacing Ω in (5) with ˆΩ which is composed of the squared OLS residuals along the diagonal. This produces an estimator that is consistent for the oracle estimator in (5) and is robust to unspecified heteroscedasticity. That is, one produces parameter estimates using the OLS estimator in (4) and then constructs standard errors by focusing attention on (5) using the squared residuals from the first stage regression. Robinson (1987) proposed using OLS for a parametric analysis in the first stage when heteroscedasticity is suspected to exist, but focused attention on then constructing the efficient FGLS estimator with oracle variance (6). To avoid misspecification, the squared residuals arising from the preliminary OLS regression are nonparametrically regressed on the covariates to obtain a consistent estimator for σ 2 (X i ), which are then used to construct Ω. After this secondary estimation is finished FGLS is performed and the variance-covariance matrix is constructed as in (6) with Ω replaced by Ω. 2.1. Nonparametric Estimation of the Scedastic Function. To estimate the scedastic function nonparametrically Robinson (1987) proposed k-nearest neighbor estimation. However, in our simulations we deploy the local constant estimator of Nadaraya (1964) and Watson (1964). In our scedastic function setting our regressand is composed of the residuals from the first stage application of OLS, ˆε = Y ˆβ OLS X. Given that we generally do not know the true data generating process which underlies the scedastic function performing nonparametric estimation is warranted. The local constant estimator of σ(x) is defined as (7) ˆσ(x) = n K h (x i,x) ˆε 2 i n = K h (x i,x) i=1 i=1 n A i (x)ˆε 2 i, i=1

4 MICHAEL O HARA AND CHRISTOPHER F. PARMETER where (8) K h (x i,x) = k s=1 h 1 s k ( ) xi x is the standard product kernel. The kernel function k(u) is typically taken to be a symmetric probability density function. The bandwidths, h 1,...,h k dictate the smoothness of the estimated scedastic function. When the level of smoothing is too small the estimated curved is rough and fluctuates rapidly whereas when the level of smoothing is too large the estimated curve misses important local structure of the underlying scedastic function. Typically datadriven methods are deployed to estimate the bandwidths in a manner which recognizes the inherent bias-variance trade-off which is at play here. 2.1.1. Data-driven Bandwidth Selection. Estimation of the bandwidths h 1,...,h k is commonly viewed as the key aspect for implementation of nonparametric estimation. Although there exist many selection methods, Hall, Li and Racine (2007) have shown that Least Squares Cross-Validation (LSCV) has the ability to smooth away irrelevant variables that may have been erroneously included into the unknown regression function. This is important in applied settings as it is not always clear which variables from a set of controls in a parametric analysis are inducing heteroscedasticity within the error terms. Thus, an agnostic approach would be to include all variables form the onset. Given that nonparametric estimators suffer from the curse of dimensionality, including many variables in one s analysis of heteroscedasticity is discomforting. However, when engaging in LSCV to determine the bandwidths, in large samples it is expected that the variables erroneously included will be automatically removed and the correct variables are all that remain. The approach of LSCV is to select the bandwidth to minimize the squared difference between the estimated function and the outcome of interest. However it is key to recognize that for a given observation, the bandwidth will be influenced by its own outcome more than any other observation s outcome. Thus, we use a leave-one-out estimator to remove this unduly influence. Our LSCV criterion is (9) LSCV (h) = min h 1...,h k h s n 2 (ˆε i ˆσ i (x i ) ) 2, where the leave-one-out estimator is defined as ˆε 2 jk h (x j,x i ) j i (10) ˆσ i (x i ) = K h (x j,x i ). j i i=1 2.2. Variants of Heteroscedasticity Robust Standard Errors. Prior to conducting our simulations to compare the relative merits of standard errors constructed using NPGLS versus heteroscedasticity robust standard errors, it is important to recognize the rich literature on the construction of this widely used estimators. While White s (1980) paper contains the essence of constructing heteroscedasticity robust standard errors, in the decades that have passed numerous modifications to this simple setup have been proposed. To assist with the

NONPARAMETRIC GLS 5 notation involved in the construction of heteroscedaticity robust variance-covariance matrices (and subsequently the standard errors) we rewrite the White s (1980) initial proposal as (11) (X X) 1 (X DˆΩX)(X X) 1, where ) (12) ˆΩ = diag 2 (ˆε 1,..., ˆε 2 1 and D = I n is the identity matrix. This setup is commonly know as HC0 in the literature focusing on construction of heteroscedasticity robust standard errors. A well known shortcoming of this estimator is that it tends to be substantially biased in small samples when the data contain leverage points, see Chesher and Jewitt (1987) for more on this. This bias works in the direction of being overly optimistic so that heteroscedasticity robust t-tests are oversized, or confidence intervals tend to be too large. The work of MacKinnon and White (1985) was the first to propose alternative constructions of heteroscedasticity robust standard errors. MacKinnon and White (1985) noted that the HC0 setup did not account for the well known fact that ˆε i ε i i and suggested a degrees of freedom correction to remedy this. Their simple degrees of freedom correction (which follows from Hinkley, 1977), known as HC1, uses (13) D = (n/(n k))i n. Alternatively, MacKinnon and White (1985), following Horn, Horn, and Duncan (1975), use (14) D = diag { (1 h 11 ) 1,...,(1 h nn ) 1} where h ii is the i th diagonal element from the so called hat matrix, X (X X) 1 X. This setup is referred to as HC2. The elegance of using HC2 is that if the error terms were homoscedastic, V ar(ε i ) = σ 2, i, then the expectation of ε 2 i would be σ 2. Lastly, MacKinnon and White (1985) show how HC0, HC1 and HC2 are all variants of a more formal jackknife estimator and propose HC3 for which (15) D = diag { (1 h 11 ) 2,...,(1 h nn ) 2}. This setup mitigates the influence that observations with large variances have on the overall estimates. The simulation results of both MacKinnon and White (1985), Cribari-Neto and Zarkos (1999, 2001, 2004) and Long and Ervin (2000) suggest that HC3 is the most well behaved (in terms of size and power) of the four different variants across a range of simulations. Theoretical work from Chesher (1989) and Chesher and Austin (1991) suggest that the perceived dominance of HC3 may not actually exist and find that in certain scenarios HC2 will perform better. Recently several additional variants for constructing heteroscedasticity robust standard errors have been suggested. Cribari-Neto (2000) proposed using (16) D = diag { (1 h 11 ) δ i,...,(1 h nn ) δ i }, where δ i = min {4,nh ii /k} while Cribari-Neto et al. (2007) develop the HC5 estimator where (17) D = diag { (1 h 11 ) δ i/2,...,(1 h nn ) δ i/2 },

6 MICHAEL O HARA AND CHRISTOPHER F. PARMETER and δ i = min {nh ii /k, max {4, (nph max /k)}}, h max = max {h 11,...,h nn } and p [0, 1]. Cribari-Neto et al. (2007) suggest setting p = 0.7. Lastly, Cribari-Neto and Beradina da Silva (2011) provide a modified HC4 estimator, defined as (18) D = diag { (1 h 11 ) δ i,...,(1 h nn ) δ i }, where (19) δ i = min {γ 1,nh ii /k} + min {γ 2,nh ii /k}. The authors suggest setting γ 1 = 1 and γ 2 = 1.5. This estimator is referred to as HC4m. Both HC4m and HC5, while providing cover against overly influential observations, rely on user-specified constants, making them unappealing in empirical work. 3. Simulation Setup Simulations were performed using a univariate and a bivariate linear model. The DGP is specified as (20) y i = β 0 + β 1 x i1 + β 2 x i2 + ǫ i with β 2 restricted to zero for the univariate case. X 1 is specified as standard uniform and X 2 as standard normal. ǫ i is specified as Normal with E(ǫ i ) = 0 1. The variance of ǫ i is σ 2 i = σ 2 (X i ) which is specified in several functional forms designed to mimic forms of heteroscedasticity that may be encountered in applied work. Each model is simulated for sample sizes of N = 50, 100, 250, 500 and 1000. All models are simulated for 1000 repetitions. A nominal 95% confidence interval is computed for ˆβ j using the standard error estimates generated from each procedure. A size comparison is performed by computing the percentage of trials in which the true value of β j falls outside of the computed confidence interval (so that the null hypothesis would be wrongly rejected). Therefore, the nominal size of the test is 5%. 4. Simulation Results Results are presented here for a univariate model and a bivariate model where the distributions of the explanatory variables differ under the specifications above. Several functional forms of heteroscedasticity are tested. For the univariate model, we have computed HC0 and HC3, as HC0 is still the most widely used version, while several papers have shown the superior performance of HC3, especially in smaller sample sizes. These are compared to the NPGLS procedure as well as oracle GLS. Table (1) shows the empirical size of the test that the parameter takes its true value in the DGP at a nominal 5% significance level. Table (1) shows the results for four version of univariate model. The empirical size of a test of the true parameter value in the DGP at a nominal 5% significance level is presented. The width of the confidence intervals as a ratio of those computed using HC3 for the other three models is also presented and is plotted in Fig. (2). The first model presented is a model 1 Long and Ervin (2000) use other distributions of the error in their simulation study. We replicate the χ 2 version here for comparison. However, this would result in a misspecified model under the standard assumptions, and so the issue would not really be one of heteroscedasticity. We assume a correctly specified model for the remainder of our study.

NONPARAMETRIC GLS 7 Figure 1. Univariate linear model size size 0.03 0.04 0.05 0.06 0.07 oracle GLS HC3 NPGLS 200 400 600 800 1000 n Figure 2. Univariate model simulations Ratio of confidence intervals NPGLS to HC3 0.80 0.85 0.90 0.95 1.00 1.05 1.10 linear sqrt exponential HC3 200 400 600 800 1000 n

8 MICHAEL O HARA AND CHRISTOPHER F. PARMETER in which there is no heteroscedasticity, but the error terms are distributed as χ 2 2 rather than Normal. This model is for comparison with Long and Ervin (2000). The second model is one in which the variance increases linearly with x, a simple specification and one that is easily applicable to situations encountered in applied research. In this case, the NPGLS estimator is slightly oversized for small samples of 100 observations or less, while HC3 is correctly sized even in these small samples. However, the difference fades with a few hundred observations so that NPGLS converges on the size of oracle GLS, as is shown in Fig. (1). The confidence intervals computed by NPGLS are tighter by about 3% relative to HC3. For the other two models, the trend in size is similar, with NPGLS slightly oversized in small samples but converging quickly for sample sizes over 100. But gains in efficiency are greater (see Fig. (2). For the model in which the variance is a square root function of x, there is a gain of a little over 5%, while for the exponential model, the gains are substantially greater at nearly 12%. For simplicity in the bivariate model, only HC3 is presented, though results are similar but more pronounced for HC0. Size and confidence interval comparison are computed for both parameter estimates, though heteroscedasticity can be only a function of one variable and unrelated to the other. Observations are taken out to 500 here, with the N = 1000 case to come. The ratios of confidence intervals relative to HC3 are presented in Fig. (4). In the first model, the variance is a root function of X 2, but uncorrelated with X 1. In this case, the NPGLS estimator starts out heavily oversized for the variable that is related to the variance (see Fig. (3)). This is likely due to the heavy burden on the small dataset in having to compute nonparametric functions for both variables. This model shows a gain of about 10% in efficiency for the variable related to the variance, and about 6% for the other variable. In the second model, the variance is a complex function of the square roots of both variables. In this case, once the NPGLS has converged in size, there are efficiency gains of 3-4% compared to HC3. 5. Conclusion The results of our Monte Carlo study suggest that there are significant gains in efficiency that are being neglected by applied researchers who opt for some version of HRSE to address heteroscedasticity. Modern computing technology and the wide availability of canned nonparametric software make nonparametric regression easily achievable such that an NPGLS procedure can be used to gain significantly tighter confidence intervals and more powerful hypothesis tests without significant loss of size in reasonable samples. Future work will extend the comparison to cases in which the explanatory variables may include categorical variables by utilizing kernels suitable for this. We will also perform the NPGLS using a k-nn estimator to relax continuity requirements on the scedastic function.

NONPARAMETRIC GLS 9 Figure 3. Bivariate model size size (nominal =.05) 0.04 0.06 0.08 0.10 0.12 0.14 model1 beta1 model1 beta2 model2 beta1 model2 beta2 100 200 300 400 500 n Figure 4. Bivariate model simulations Ratio of confidence intervals NPGLS to HC3 0.80 0.85 0.90 0.95 1.00 1.05 1.10 model1 beta1 model1 beta2 model2 beta1 model2 beta2 100 200 300 400 500 n

10 MICHAEL O HARA AND CHRISTOPHER F. PARMETER Table 1. Univariate results Scedasticity n size CI to HC0 CI to HC3 function OLS HC0 HC3 GLS NPGLS GLS NPGLS GLS NPGLS 50 0.050 0.047 0.051 0.069 0.060 1.160 1.017 1.096 0.960 100 0.053 0.055 0.048 0.066 0.064 1.123 1.005 1.092 0.977 σi 2 = var(u i) 250 0.049 0.045 0.046 0.062 0.052 1.110 0.998 1.098 0.987 u i χ 2 2 500 0.055 0.052 0.057 0.060 0.053 1.106 0.998 1.100 0.993 1000 0.048 0.046 0.047 0.048 0.048 1.104 0.998 1.101 0.996 σ 2 i 50 0.051 0.064 0.050 0.045 0.064 1.012 0.992 0.956 0.936 100 0.057 0.065 0.059 0.053 0.061 0.988 0.978 0.961 0.951 = 1 + 2x 250 0.040 0.045 0.041 0.050 0.051 0.973 0.972 0.962 0.961 500 0.047 0.049 0.047 0.047 0.046 0.968 0.971 0.962 0.965 1000 0.047 0.045 0.045 0.045 0.040 0.966 0.970 0.964 0.967 50 0.054 0.074 0.056 0.062 0.072 0.925 0.986 0.874 0.931 σi 2 = x i 100 0.047 0.064 0.046 0.062 0.055 0.894 0.967 0.870 0.941 250 0.052 0.047 0.056 0.041 0.047 0.879 0.957 0.870 0.947 500 0.048 0.041 0.052 0.046 0.048 0.872 0.952 0.867 0.947 1000 0.048 0.047 0.055 0.050 0.050 0.869 0.944 0.866 0.942 50 0.069 0.061 0.062 0.065 0.083 0.940 0.902 0.886 0.850 100 0.062 0.077 0.051 0.052 0.065 0.914 0.888 0.887 0.863 σi 2 = exp(2 x i) 250 0.069 0.060 0.058 0.045 0.052 0.895 0.881 0.884 0.870 500 0.068 0.041 0.052 0.048 0.045 0.888 0.881 0.882 0.876 1000 0.069 0.053 0.050 0.053 0.056 0.885 0.882 0.883 0.880 Table 2. Bivariate results Scedasticity n size (5% nominal) CI to HC3 function HC3 NPGLS GLS NPGLS function ˆβ1 ˆβ2 ˆβ1 ˆβ2 ˆβ1 ˆβ2 ˆβ1 ˆβ2 50 0.046 0.056 0.078 0.125 0.939 0.950 0.889 0.827 100 0.057 0.049 0.086 0.119 0.939 0.954 0.896 0.833 σ 2 i = 1 + x 2 + 5 250 0.059 0.051 0.072 0.087 0.946 0.954 0.926 0.872 500 0.057 0.046 0.060 0.073 0.948 0.957 0.939 0.895 50 0.051 0.050 0.096 0.128 1.053 1.018 0.883 0.846 100 0.038 0.055 0.067 0.098 1.065 1.023 0.931 0.894 σ 2 i = 1 + (x 1 ) x 2 + 5 250 0.053 0.054 0.067 0.065 1.078 1.030 0.962 0.932 500 0.059 0.054 0.066 0.054 1.081 1.030 0.974 0.955