Heteroskedasticity/Autocorrelation Consistent Standard Errors and the Reliability of Inference

Size: px

Start display at page:

Download "Heteroskedasticity/Autocorrelation Consistent Standard Errors and the Reliability of Inference"

Maude Roberts
6 years ago
Views:

1 Heteroskedasticity/Autocorrelation Consistent Standard Errors and the Reliability of Inference Aris Spanos Department of Economics, Virginia Tech, USA James J. Reade Department of Economics, University of Reading, UK April 2015 [First draft] Abstract The primary aim of the paper is to investigate the error-reliability of F tests that use Heteroskedasticity-Consistent Standard Errors (HCSE) and Heteroskedasticity and Autocorrelation-Consistent Standard Errors (HACSE) using Monte Carlo simulations. For the design of the appropriate simulation experiments a broader perspective for departures from the homoskedasticity and autocorrelation assumptions is proposed to avoid an internally inconsistent set of probabilistic assumptions. Viewing regression models as based on the first two conditional moments of the same conditional distribution brings out the role of the other probabilistic assumptions such as Normality and Linearity, and provides a more coherent framework. The simulation results under the best case scenario for these tests show that all the HCSE/HACSE-based tests exhibit major size and power distortions. The results call seriously into question their use in practice as ways to robustify the OLS-based F-test. 1

2 1 Introduction The basic objective of the paper is to revisit certain aspects of the traditional strategies in dealing with departures from the homoskedasticity and autocorrelation assumptions in the context of the Linear Regression (LR) model. In particular, the paper aims to appraise the error-reliability of Heteroskedasticity-Consistent Standard Errors (HCSE) (White, 1980) and its extension to Heteroskedasticity and Autocorrelation Consistent Standard Errors (HACSE); see Newey and West (1987), Andrews (1991), Hansen (1992), Robinson (1998), Kiefer et al (2005). The conventional wisdom relating to the use of HCSE has been articulated aptly by Hansen (1999) in the form of the following recommendation to practitioners:... omit the tests of normality and conditional heteroskedasticity, and replace all conventional standard errors and covariance matrices with heteroskedasticity -robust versions. (p. 195) The heteroskedasticity-robust versions of the conventional standard errors and covariance matrices refers to HCSE/HACSE as they pertain to testing hypotheses concerning the unknown regression coefficients. The question raised by the above recommendation is how one should evaluate the robustness of these procedures. It is argued that the proper way to do that is in terms of the error-reliability of the inference procedures it gives rise to. In the case of estimation one needs to consider the possibility that the robustness strategy gives rise to non-optimal (e.g. inconsistent) estimators. In the case of testing, robustness should be evaluated in terms of the discrepancy between the relevant actual and nominal (assumed) error probabilities. As argued by Phillips (2005a): Although the generality that HAC estimation lends to inference is appealing, our enthusiasm for such procedures needs to be tempered by knowledge that finite sample performance can be very unsatisfactory. Distortions in test size and low power in testing are both very real problems that need to be acknowledged in empirical work and on which further theoretical work is needed. (p. 12) The key problem is that any form of statistical misspecification, including the presence of heteroskedasticity/autocorrelation is likely to induce a discrepancy between the nominal and actual error probabilities of any test procedure, and when this discrepancy is sizeable, inferences are likely to give rise to erroneous results. The surest way to an unreliable inference is to use a 05 significance level test but the actual type I error is closer to 90. Such discrepancies can easily arise in practice even with are often considered minor departures from certain probabilistic assumptions; see Spanos and McGuirk (2001). Hence, the need to evaluate such discrepancies associated with the robustified procedures using HCSE/HACSE to ensure that they are reliable enough for inference purposes. The primary aim of the discussion that follows is twofold. First, toplacethe traditional recommendation of ignoring non-normality and using HCSE/HACSE for inference in conjunction with the OLS estimators in a broader perspective that sheds 2

3 light on what is being assumed for the traditional account to hold and what are the possible errors that might undermine this account. Second, to evaluate the extent to which this strategy addresses the unreliability of inference problem raised by the presence of heteroskedasticity/autocorrelation. Section 2 revisits the robustification of inference procedures when dealing with departures from non-normality and heteroskedasticity and calls into question some of the presumptions underlying the use of the HCSE for inference purposes. It is argued that specifying the probabilistic structure of the LR model in terms of the error term provides a very narrow perspective on the various facets of modeling and inference, including specification (specifying the original model), misspecification testing (evaluating the validity of the model s assumptions) and respecification (respecifying the original model when found wanting). Section 3 presents a broader probabilistic perspective that views statistical models as parameterizations of the observable processes {Z :=( X ) N} underlying the data Z 0 :={( x ) =1 2 }. When viewed in the broader context, the probabilistic assumptions specifying the LR model are shown to be interrelated, and thus relaxing one at a time can be misleading in practice. In addition, the traditional perspective on HCSE/HACSE is shown to involve several implicit assumptions that might be inappropriate in practice. For example, the use of the HACSE raises the possibility that the OLS estimator might be inconsistent when certain (implicit) restrictions imposed on the data are invalid. Section 4 uses Monte Carlo simulations with a view to investigate the effectiveness of the HCSE strategies vis-a-vis the reliability of inference pertaining to linear restrictions on the regression coefficients. The simulation results are extended in section 5 to include the error-reliability of testing procedures that use the HACSE. The simulation results call into question the widespread use of the HCSE/HACSE-based tests because they are shown to have serious size and power distortions that would lead inferences astray in practice. 2 Revisiting non-normality/heteroskedasticity 2.1 The traditional perspective on Linear Regression The traditional Linear Regression (LR) model (table 1) is specified in terms of the probabilistic assumptions (1)-(4) pertaining to the error term process {( X =x ) N}, where x denotes the observed values of regressors X. Table 1 - The Linear Regression (LR) Model = 0 + β > 1 x + N=(1 2) (1) Normality: ( X =x ) v N( ), (2) Zero mean: ( X =x )=0 (3) Homoskedasticity: ( 2 X =x )= 2 (4) No autocorrelation: ( X =x )=0 for 3

4 For inference purposes the LR model is framed in terms of the data: Z 0 :={( x )=1 2} Z 0 :=(y:x 1 ) y:( 1) X 1 :( ) and expressed in the matrix notation: y=xβ + u where X:=(1:X 1 ) is an ( [+1]) matrix, 1 is a column of 1 s and β > := 0 :β > 1. To secure the data information adequacy, the above assumptions are supplemented with an additional assumption: (5) No collinearity: Rank(X)=+1 The cornerstone of the traditional LR model is the Gauss-Markov theorem for the optimality of the OLS estimator: bβ =(X > X) 1 X > y as Best Linear Unbiased Estimator (BLUE) of β under the assumptions (2)-(5), i.e., bβ has the smallest variance (relatively efficient) within the class of linear and unbiased estimators. The above recommendation to ignore any departures from assumptions (1) and (3) and use HCSE, stems from four key presumptions. The first presumption, stemming from the Gauss-Markov theorem, is that the Normality assumption does not play a key role in securing the optimality of the OLS estimator for inference purposes. The second presumption is that all inferences based on the OLS estimators θ:=( b β b 2 ): 2 = 1 1 bu> bu bu=y X β b of θ:=(β 2 ) can be based on the asymptotic approximations: bβ v N(β 2 (X > X) 1 ) 2 P 2 (1) where v denotes asymptotically distributed and P denotes convergence in probability. The third presumption is that when the homoskedasticity assumption in (3) is invalid, and instead: (3)* (uu > )=Λ =diag( )6=σ 2 I (2) the OLS estimator β b remains unbiased and consistent, but is less efficient relative to the GLS estimator β= e X > Λ 1 X 1 X > Λ 1 y since: ( β)=(x b > X) 1 X > Λ X(X > X) 1 ( β)= e X > Λ 1 X 1 In light of that, one could proceed to do inference using b β if only a consistent estimator of ( b β) can be found. White (1980) argued that, although estimating Λ 4

5 suffers from the incidental parameter problem the number of unknown parameters ( ) increases with estimating G= X > Λ X does not since it involves only 1 (+1) unknown terms. He proposed a consistent estimator of G: 2 bγ(0)= P 1 =1 b2 (x x > ) that gives rise to Heteroskedasticity-Consistent Standard Errors (HCSE) for β,by b replacing 2 (X > X) 1 in (1) with: ³ \ ( β)= b 1 1 X > X) 1 P ³ 1 =1 b2 x x > X > X) (3) The fourth presumption is that one can view the validation of the LR model one probabilistic assumption at a time. This presumption is more subtle because it is often insufficiently realized that the assumptions (1)-(4) are interrelated and thus any form of model validation should take that into account. The misleading impression that departures from these assumptions can be viewed individually stems partly from the fact that the probabilistic structure is specified of the error process {( X =x ) N=(1 2 )} However, the probabilistic structure that mattersformodelingandinferenceisthatoftheobservablestochasticprocess{z N} underlying the data Z 0 :={( x ) =1 2 }. Thisisbecausethelatterdefines the distribution of the sample as well as the likelihood function, that provide the cornerstones of both frequentist and Bayesian approaches to inference. 2.2 How narrow is the traditional perspective? All four presumptions behind the recommendation to ignore non-normality and use HCSE for testing hypotheses pertaining to β such as: 0 : β = 0 vs. 1 : β 6= 0 (4) can be called into question in terms of being vulnerable to several potential errors that can undermine the reliability of inference. To begin with, the Gauss-Markov theorem is of very little value for inferences purposes for several reasons. First, the linearity of b β is a phony property, unbiasedness (( b β)=β) without consistency is dubious, and relative efficiency within an artificially restricted class of estimators is of very limited value. In addition, despite the fact that the theorem pertains to the finite sample properties of b β it cannot be used as a reliable basis for inference because it yields an unknown sampling distribution for bβ i.e. bβ v? D(β 2 (X > X) 1 ) (5) This provides a poor basis for any form of finite sample inference that involves error probabilities calling for the evaluation of tail-areas. The problem was demonstrated by Bahadur and Savage (1956) in the context of the simple Normal model: v NIID( 2 ) θ:=( 2 ) R R + R N 5

6 They showed that when the Normality assumption is replaced with the existence of the first two moments, no unbiased or consistent t-type test for 0 : =0 vs. 1 : 6= 0exist: It is shown that there is neither an effective test of the hypothesis that =0, nor an effective confidence interval for, noraneffective point estimate of. These conclusions concerning flow from the fact that is sensitive to the tails of the population distribution; parallel conclusions hold for other sensitive parameters, and they can be established by the same methods as are here used for. (p. 1115) That is, the existence of the first two moments (or even all moments) provides insufficient information to pin down the tail-areas of the relevant distributions to evaluate the type I and II error probabilities with any accuracy. The Bahadur-Savage result explains why, practitioners who otherwise praise the Gauss-Markov theorem, ignore it when it comes to inference, and instead they appeal to the asymptotic distributions of θ:=( b β b 2 ). To justify the asymptotic approximations in (1), however, calls for supplementing the Gauss-Markov assumptions (2)-(5) with additional restrictions on the underlying vector stochastic process {Z :=( X ) N} such as: (X lim > X) =Σ 22 0 (P) lim (X > y) = 21 6=0 Lastly, despite the impression given by the Gauss-Markov theorem that the Normality assumption has a very limited role to play in inferences pertaining to θ:=(β 2 ), the error-reliability of inference concerning 0 in (4) is actually adversely affected, not only by the Normality in assumption (1), but also by the non-normality (especially any form of skewness) of the {X N} process; see Ali and Sharma (1996). The third presumption is also questionable two reasons. First, the claim about the GLS being relatively more efficient than the OLS estimator depends on the covariance matrix V being known, which is never the case in practice. Second, it relies on asymptotics without any reason to believe that the use of the HCSE will alleviate the unreliability of inference problem evaluated in terms of the discrepancy between the actual and nominal error probabilities for a given. Indeed, there are several potential errors that can give rise to sizeable discrepancies. The fourth presumption is potentially the most questionable because what is often insufficiently appreciated is that the error process {( X =x ) N} provides a much narrower perspective on the specification, misspecification and respecification facets of inference because, by definition = 0 β > 1 x and thus the error process retains the systematic component ( X =x )= 0 + β > 1 x.thiswasfirst demonstrated by Sargan (1964) by contrasting the modeling of temporal dependence using the error term with modeling it using the observable process {Z :=( X ) N} He showed that the LR model with an AR(1) error term: = 0 +β > 1 x + = 1 + = 0 (1 )+β > 1 x + 1 β > 1 x + is a special case of a Dynamic Linear Regression (DLR) model: = 0 +α > 1 x α > 3 x 1 + (6) 6

7 subject to (non-linear) common factor restrictions: α 3 +α 1 2 =0 (7) The DLR model in (6) arises naturally by modeling the Markov dependence directly in terms of the observable process {Z :=( X ) N} via: = ( X =x Z 1 )+ N In relation to (7), it is important to note that the traditional claim that the OLS estimator β b retains its unbiasedness and consistency despite departures from the noautocorrelation assumption, presumes the validity of the common factor restrictions; see Spanos (1986). The latter is extremely important because without it β b will be an inconsistent estimator. A strong case can be made that the probabilistic assumptions (1)-(4) pertaining totheerrortermprovideaninadequateperspectiveforallthreefacetsofmodeling: specification, misspecification testing and respecification. To illustrate that consider the case where potential departures from assumption (4), i.e. ( X =x ) 6= 0 for (8) are being probed using the Durbin-Watson (D-W) misspecification test based on: 0 : =0vs. 1 : 6= 0 in the context of the Autocorrelation-Corrected (A-C) LR model: = 0 +β > 1 x + = 1 + N (9) InthecasewheretheD-Wrejects 0 it is traditionally interpreted as providing evidence for the (A-C) LR model in (9). Hence, the traditional respecification takes the form of adopting (9) and declaring that the misspecification has been accounted for by replacing the original OLS estimator with a Generalized Least Squares (GLS) type estimator. This strategy, however, constitutes a classic example of the fallacy of rejection: misinterpreting reject 0 [evidence against 0 ] as evidence for a particular 1. A rejection of 0 by a D-W test provides evidence against 0 and for the presence of generic temporal dependence in (8), but does not provide any evidence for the particular form assumed by 1 : 1 : ( X =x )=( ) 2 1 =1 2 (10) 2 which stems from the AR(1) model for the error term; (10) represents one of an infinite number of dependence forms that (8) allows for. To have evidence for (9) one needs to validate all the assumptions of the (A-C) LR model anew, including the validity of the common factor restrictions in (7); see Mayo and Spanos (2004). The broader and more coherent vantage point for all three facets of modeling stems from viewing the Linear Regression model as specified in terms of the first two 7

8 moments of the same conditional distribution ( X ;θ) i.e., the regression and skedastic functions: ( X =x )=(x )( X =x )=(x ) x R (11) where the functional forms () and () are determined by the joint distribution ( X ;ϕ). From this perspective, departures from assumptions pertaining to one of the two functions is also likely to affect the other. As argued next, the above claims pertaining to the properties of β b under (3)* depend crucially on retaining some of other Gauss-Markov assumptions, especially the linearity of the regression function stemming from assumption (2). The broader perspective indicates that relaxing one assumption at a time might not be a good general strategy because the model assumptions are often interrelated. Having said that, the focus of discussion in this paper will be the best case scenario for the traditional textbook perspective on non-normality and heteroskedasticity/autocorrelation. This is the case where all the other assumptions of the Linear Regression model, apart from Normality and homoskedasticity, are retained and the departures from the latter are narrowed down to accord with the traditional perspective. The question which naturally arises at this stage is: what are the probabilistic assumptions of the LR model whose validity is retained by this best case scenario? To answer that question unambiguously one needs to specify the LR model in terms of a complete set of probabilistic assumptions pertaining to the observable process {Z :=( X ) N} underlying the data Z 0 Choosing an arbitrary form of heteroskedasticity and simulating the LR model does not render the combination a coherent statistical model. 3 A broader Probabilistic perspective Viewing the LR model as a parameterization of the observable vector stochastic process {Z :=( X ) N} underlying the data Z 0, has numerous advantages, including (i) providing a complete and internally consistent set of testable probabilistic assumptions, (ii) providing well-defined statistical parameterizations for the model parameters, (iii) bringing out the interrelationship among the probabilistic assumptions specifying the LR model, and (iv) enabling the modeler to distinguish between the statistical and the substantive premises of inference; see Spanos (2006). This is achieved by relating the LR model to the joint distribution of the observable process {Z :=( X ) N} via the sequential conditioning: (Z 1 Z ; φ) = Q =1 (Z ; ϕ)= Q =1 ( X ; ψ 1 ) (X ; ψ 2 ) (12) The joint distribution (Z ; ϕ) takes the form: µ µµ µ 1 11 v N X μ 2 σ 21 σ > 21 Σ 22 (13) 8

9 which then gives rise to the conditional and marginal distributions, ( X ; ψ 1 ) and (X ; ψ 2 ): ( X =x ) v N( 0 +β > 1 x 2 ) X v N (μ 2 Σ 22 ) The LR model, as a purely probabilistic construct, is specified exclusively in terms of the conditional distribution ( X ; ψ 1 ) due to weak exogeneity; see Engle et al (1983). To be more specific, the LR model comprises the statistical Generating Mechanism (GM) in conjunction with the probabilistic assumptions [1]-[5] (table 2). This purely probabilistic construal of the Normal, Linear Regression model brings out the relevant parameterizations of interest in θ in terms of the primary parameters ϕ:=(μ 1 μ 2 11 σ 21 Σ 22 ) as well as the relationships between the probabilistic assumptions of {Z > :=( X > ) N} and the model assumptions (table 3). Table 2: The Normal, Linear Regression Model Statistical GM: = 0 + β > 1 x + N [1] Normality: ( X =x ) v N( ), [2] Linearity: ( X =x )= 0 + β > 1 x linear in x [3] Homoskedasticity: ( X =x )= 2 free of x R [4] Independence: {( X =x ) N} is an independent process, [5] t-invariance: θ:=( 0 β 1 2 ) do not change with where 0 = 1 β > 1 μ 2 β 1 =Σ 1 22 σ 21 2 = 11 σ > 21Σ 1 22 σ 21 Table3:Reductionvs.modelassumptions Reduction: {Z N} Model: {( X =x ) N} N [1]-[3] I [4] ID [5] These relationships are particularly important for guiding Mis-Specification (M-S) testing probing the validity of the model assumptions [1]-[5] as well as respecification choosing an alternative model when any of the assumptions [1]-[5] are found to be invalid for data Z 0 Of particular interest in this paper is the connection between [1] Normality and [3] Homoskedasticity assumptions which stems from the joint Normality of Z In relation to heteroskedasticity, it is very important to emphasize that the traditional assumption in (2) does not distinguish between heteroskedasticity and conditional variance heterogeneity: ( X =x )= 2 () for x R N where 2 () is an unknown function of the index. This distinction is crucial because the source of two departures is very different, i.e. heteroskedasticity stems from non-normality of Z but heterogeneity stems from the covariance heterogeneity of Z (table 3). 9

10 3.1 Viewing heteroskedasticity in a broader perspective The recommendation to ignore any departures from Normality and Heteroskedasticity and instead use HCSE for inferences pertaining to θ:=( 0 β 1 2 ) takes for granted that all the other assumptions [2], [4]-[5] are valid for data Z 0 This is unlikely to be the case in practice because of the interrelationships among the model assumptions. It is obvious that in cases where other model assumptions, in addition to [1] and [3], are invalid, the above recommendation will be a terrible strategy for practitioners. Hence, for the purposes of an objective evaluation of this recommendation it is important to focus on the best case scenario which assumes that the model assumptions [2], [4]-[5] are valid for data Z 0 Due to the interrelationship between the Normality of Z and model assumptions [1]-[3], it is important to focus on joint distributions that retain the linearity assumption in [2], including the underlying parameterizations of θ:=( 0 β 1 2 ). Such a scenario can be accommodated into a coherent PR specification by broadening the joint Normal distribution in (13) to the Elliptically Symmetric (ES) family (Kelker, 1970), which can be specified in terms of the first two moments of Z via: µ µµ µ 1 11 σ v ES > 21 ; () (14) X μ 2 σ 21 Σ 22 For different functional forms () this family includes the Normal, the Student s t, the Pearson type II, the Logistic and other distributions; see Fang et al (1990). The appropriateness of the ES family of distributions for the best case scenario stems from the fact that: (i) All the distributions are bell-shape symmetric, resembling the Normal in terms of the shape of the density function, but its members are either platykurtic ( 4 3)or leptokurtic ( 4 3), with the Normal being the only mesokurtic ( 4 =3) distribution. (ii) The regression function for all the members of the ES family of distributions is identical to that of the Normal (Nimmo-Smith, 1979): ( X =x )= 0 + β > 1 x x R (iii) The statistical parameterization of θ:=( 0 β 1 2 ) (table2)isretainedbyall members of the ES family. This ensures that the OLS estimators of ( 0 β 1 ) will be unbiased and consistent. (iv) Homoskedasticity characterizes the Normal distribution within the ES family, in the sense that all the other distributions have a heteroskedastic conditional variance; see Spanos (1995) for further discussion. For all the other members of the ES family the skedastic function takes the generic form: ( X =x )=(x )= 2 [ () ] x () R where 2 = 1 2 (X μ 2 ) > Σ 1 22 (X μ 2 ) and () is the marginal density of x ; see Chu (1973). For instance, in the case where (14) is Student s t with degrees of freedom 10

11 (Spanos, 1994): ³ ( (X ))= (15) Interestingly, the above form of heteroskedasticity in (15) is not unrelated to the misspecification test for homoskedasticity proposed by White (1980), which is based on testing the hypotheses: 0 : γ 1 = 0 vs. 1 : γ 1 6= 0 in the context of the auxiliary regression: b 2 = 0 + γ > 1 ψ + where is a white-noise error and ψ :=( ) =1 2 It is often claimed that this White test is very general as a misspecification test for departures from the homoskedasticity assumption; see Greene (2011). The fact of the matter is that the squares and cross-products of the regressors in ψ stem primarily from retaining the linearity and the other model assumptions, and relaxing only the Homoskedasticity assumptions. The broader perspective stemming from viewing the LR model as based on the regression and skedastic functions in (11) would bring out the fact that the above quadratic form of heteroskedasticity is likely to be the exception, not the rule when the underlying joint distribution is non-normal; see Spanos (1999), ch Monte Carlo Simulations 1: HCSE To investigate the reliability of inference in terms of any discrepancies between actual and nominal error probabilities we focus on linear restrictions on the coefficients β based on the hypotheses of interest: 0 :(Rβ r)=0 vs. 1 :(Rβ r) 6= 0 rank(r)= It is well-known that the optimal test under assumptions [1]-[5] (table 2) is the F-test defined by the test statistic: (y)= (R r) > [R(X > X) 1 R > ] 1 (R r) 2 = (R r) > [R(X > X) 1 R > ] 1 (R r) ( 1 ) 0 v F ( 1) u > u where β=(x b > X) 1 X > y 2 = (u> u) bu= y Xb β 1 in conjunction with the rejection region: 1 ={y: (y) } The best case scenario in the context of which the error-reliability of the above F- test will be investigated is that the simulated data are generated using the Student s t Linear/Heteroskedastic Regression model in table 4. The choice of this model is based on the fact that, in addition to satisfying conditions (i)-(iv) above, the Student s t distribution approximates the Normal distribution as the number of degrees of freedom () increases. 11

12 Table 4: Student s t, Linear/Heteroskedastic Regression model Statistical GM: = 0 + β > 1 X + N [1] Student s t: ( X ; θ) is Student s t with degrees of freedom, [2] Linearity: ( (X ))= 0 + β > 1 X where X : ( 1) ³ [3] Heteroskedasticity: ( (X ))= ³ 1+ 1 (X μ 2 ) > Σ 1 22 (X μ 2 ) [4] Independence: {Z N} is an Independent process, [5] t-invariance: θ:=( 0 β 1 2 μ 2 Σ 22 ) are t-invariant, where 0 = 1 β > 1 μ 2 β 1 =Σ 1 22 σ 21 2 = 11 σ > 21Σ 1 22 σ 21 Maximization of the log-likelihood function for the above Student s t model in table 4, yields MLE of β > :=( 0 β > 1 ) that take an estimated Generalized Least Squares (GLS) form: bβ =(X > b Ω 1 X) 1 X > b Ω 1 y bω=diag (b 1 b 2 b ) µ b = e (X bμ 2 ) > Σ b 1 22 (X bμ 2 ) =1 b 2 = + ( u > Ω 1 u ) with bμ 2 and Σ b 22 denoting the MLE of μ 2 and Σ 22 where all the above estimators are derived using numerical optimization; see Spanos (1994). The F-type test statistic based on the above MLEs takes the form: (y)= (R r) > [R(X > ΩX) 1 R > ] 1 (R r) 2 0 F ( 1) where 0 denotes distributed approximately under 0. The asymptotic form of this test statistic is: F-MLE: (y)= (R r) > [R(X > ΩX) 1 R > ] 1 (R r) 2 0 v 2 () (16) For simplicity we consider the case where R=I:=diag(1 1 1) for the simulations that follow. 4.1 Monte Carlo Experiment 1 This design will be based on the same mean and covariance matrix for both the Normal and the Student s t distributions: 1 v ES

13 giving rise to the following true parameter values: 0 =19875 β > 1 =( ) 2 =1 However, in order to evaluate the error-reliability of different inference procedures we will allow different values for degrees of freedom to investigate their reliability as increases; the Student s t approximates better the Normal distribution. In principle one can generate samples of size using the statistical Generating Mechanism (GM): y () = x () β 1 + p (x () )ε () =1 2 ³ ³ (x )= (x + 2 μ 2) > Σ 1 22 (x μ 2) to generate the artificial data realizations: y (1) y (2) y () where y () :=( 1 2 ) > using pseudo-random numbers for ε () St(0 I; +) and X () v St(μ 2 Σ 22 ; ) However, the simulations are more accurate when they are based on the multivariate Student s t distribution: Z St (μ Σ;) N (17) and then (μ Σ) are reparameterized to estimate the unknown parameters θ:=( 0 β 1 2 ). Let U be an dimensional Normal random vector: Using the transformation: Z=μ+ U NIID (0 I) ³q 1 L > U where L > L=Σ v 2 () is a chi-square distributed random variable with degrees of freedom, and U and are independent, one can generate data from a Student s t random vector (17); see Johnson (1987). Theaimoftheaboveexperiment1istoinvestigatetheempiricalerrorprobabilities (size and power) of the different estimators of ( b β) that give rise to the HCSE F-type test: (y)=(r b β r) > [R \ ( b β)r] 1 (R b β r) 0 v 2 () (18) \ where R=I:=diag(1 1 1) and ( β) b is given in (3). For the probing of the relevant error probabilities we consider six scenarios: (a) using two different values for the degrees of freedom parameter (), =4 =8, (b) using four sample sizes, =50 =100=200 The idea is that increasing renders the tails of the underlying Student s t distribution less leptokurtic and closer to the Normal. For comparison purposes, the Standard Normal [N(0 1)] is plotted together with the Student s t densities [St(0 1; )] for =4 and =8 in the figure below. note that the St(0 1; ) densities are rescaled to ensure that their variance is also unity and not, to render them comparable 2 to the N(0,1). The rescaled Student s t density is: (;01)= Γ ( +1 2 Γ (1+ ( +1 ( 2) ) 2 ) p for = = 2 2 ) 13

14 yielding: ( )=0 ( )=1 f(x) St(0,1;4) St(0,1;8) N(0,1) x To get some idea about the differences in terms of the tail area: 0455 for N(0,1) P( 2)= 0805 for St(8) 1161 for St(4) Thekurtosiscoefficient of the Normal distribution is mesokurtic since 3 =3 but the Student s t is leptokurtic: 3 = ( ) 3 3 =3+ 6 ( 4) ( ) which brings out the role of the degrees of freedom parameter It is important to note that the degrees of freedom of the conditional distribution associated with the Student s t Linear Regression model in table 4 increases with the number of regressors, i.e. + The different sample sizes are chosen to provide a guide as to how large is large enough for asymptotic results to be approximately acceptable. These choices for ( ) are based on shedding maximum light on the issues involved using the smallest number of different scenarios. Numerous other choices were used before reducing the scenarios to 6 that adequately represent the simulation results. The relevant error probabilities of the HCSE-based test are compared to F-test (16) based on the Maximum Likelihood Estimators (MLEs) in table 4. Actual type I error probability (size) of HCSE-based tests. Figure 1 summarizes the size properties of the following forms of the F-test: (i) OLS: β b in conjunction with 2 (X > X) 1 (ii) HCSE-HW: β b in conjunction with the Hansen-White estimator of ( β) b 14

15 (iii) HCSE-NW: b β in conjunction with the Newey-West estimator of ( b β) (iv) HCSE-A: b β in conjunction with the Andrews estimator of ( b β) and (v) F-MLE: the F-test based on the MLEs of both β and its covariance matrix for the Student s t Linear Model in table 4, using its asymptotic version in (16) to ensure comparability with the F-type tests (i)-(iv). Fig. 1: Size of F-HCSE-based tests for different ( ) The general conclusion is that all the heteroskedasticity robustified F-type tests exhibit serious discrepancies between the actual and nominal type I error of 05. The actual type I error is considerably greater than the nominal for =4 and all sample sizes from =50 to =200, ranging from 12 to 37. The only case where the size distortions are smaller is the case =200 and =8 but even then the discrepancies will give rise to unreliable inferences. In contrast, the F-MLE has excellent size properties for all values of and What is also noticeable is the test in (i) is not uniformly worse than the other HCSE-based tests in terms of its size. Any attempt to rank the HCSE-based F-tests in terms of size distortion makes little sense because they all exhibit serious discrepancies from the nominal size. The power properties of the HCSE-based tests. The general conclusion is that the power curves of the HCSE-based tests (i)-(iv) are totally dominated by the F-MLE test by serious margins for all scenarios with =4, indicating that the degree of leptokurtosis has serious effects not only on the size but also the power of HCSE-based tests; see figures 2-3, 6-7, The power domination of the F-MLE test becomes less pronounced, but still significant, for scenarios with =8 as increases; see figures 4-5, 8-9 and Figure 2, depicts the power functions of the HCSE-based tests (i)-(iv) for =50 and =4 and different discrepancies from the null. The plot indicates serious power 15

16 distortions for all tests in (i)-(iv). These distortions are brought out more clearly in the size-corrected power given in figure 3. For instance, in the case of 75 discrepancy from the null the HCSE-HW has power 47 and the F-MLE has power equal to 94! The best from the HCSE-based tests have power less than 74; thatisahugelossof power for =50 Fig. 2: Power of F-HCSE-based tests for =50 =4 Fig. 3: Size-adjusted power of F-HCSE-based tests for =50 =4 16

17 Fig. 4: Power of F-HCSE-based tests for =50 =8 Fig. 5: Size-adjusted power of F-HCSE-based tests for =50 =8 What is worth noting in the case (=50 =8) is that the robust versions of the covariance matrix are worse than the OLS with the traditional covariance matrices in terms of the error-reliability of the F-type test. This calls into question any claims that the use of the HCSE is always judicious and carries no penalty. For instance, for.5 discrepancy from the null the best of these tests has power.63 but the F-MLE test has power

18 Fig. 6: Power of F-HCSE-based tests for =100 =4 Fig. 7: Size-adjusted power of F-HCSE-based tests for =100 =4 In the case (=100 =4) there is a significant loss of power associated with the HCSE-based F-tests is that the robust versions of the covariance matrix are worse than the OLS with the traditional covariance matrices in terms of the error-reliability of the F-type test. This calls into question any claims that the use of the HCSE is always judicious and carries no penalty. 18

19 Fig. 8: Power of F-HCSE-based tests for =100 =8 Fig. 9: Size-adjusted power of F-HCSE-based tests for =100 =8 The HCSE-based tests show less power distortions in the scenario (=100=8) than previous scenarios, but there size distortion renders them unreliable enough to be of questionable value in practice. 19

20 Fig. 10: Power of F-HCSE-based tests for =200 =4 Fig. 11: Size-adjusted power of F-HCSE-based tests for =200 =4 Despite the increase in the sample size in the case (=200 =4) there are significant power distortions associated with the HCSE-based F-tests, stemming primarily from the leptokurticity of the underlying distribution. This calls into question the pertinency of the recommedation to ignore departures from Normality. 20

21 Fig. 12: Power of F-HCSE-based tests for =200 =8 Fig. 13: Size-adjusted power of F-HCSE-based tests for =200 =8 The scenario (=200 =8) representsthebestcaseintermsofthedistortions of power associated with the HCSE-based F-tests, but even in this case the size distortions are significant enough (at least double the nominal size) to render the reliability of these tests questionable. 21

22 5 Monte Carlo Experiment 2: HACSE 5.1 Traditional perspective Thisisthecasewheretheerrortermassumptions(3)-(4),(uu > X)= 2 I are relaxed together by replacing this assumption with: (3)*-(4)*: (uu > X)=V (19) where V :=[ ] =1 =( X). InthiscasetherelevantG =X > V X is: G = P 1 =1 x x > = P = 1 P =1 (2 x x > )+ 1 1 P =1 =+1 (x x > +x x > ) (20) When only the assumption of homoskedasticity (3) is relaxed, only the first term is relevant, but when both are relaxed the other term represents the temporal dependence that stems from allowing for error autocorrelation. An obvious extension of the White (1980) estimator Γ(0)= b P 1 =1 b2 (x x > ) of G =X > V X in the case where there are departures from both homoskedasticity and autocorrelation, is to use the additional term in (20) that involve cross-products of the residuals with their lags: bγ()= 1 P =+1 b b x x > =1 2 Hansen (1992) extended the White estimator to derive an estimator of G =X > V X : Hansen-White: G b = Γ(0)+ b P h i bγ()+ b =1 Γ > () 0 Newey and West (1987) used a weighted sum (based on the Bartlett kernel) of these matrices to estimate G : Newey-West: G b =bγ(0)+ P ³ h i =1 (1 ) Γ()+bΓ b > () 0 +1 Andrews (1991) proposed different weights ( 1 2 ): Andrews: G b = Γ(0)+ b P h i =1 bγ()+ Γ b > () Using any one of these estimators G b of G yields a Heteroskedasticity-Autocorrelation consistent estimator of the SE of β b : ^ ( β b h ( )= 1 1 X> X) 1 G b ( 1 X> X) 1 i 22

23 5.2 A broader probabilistic perspective for HACSE How does one place the above departures from the error assumptions (3)-(4), pertaining to homoskedasticity and non-autocorrelation into a broader but coherent perspective? The above estimators of G =X > V X make it clear that the temporal dependence in data Z 0 is modeled via the error term by indirectly imposing a certain probabilistic structure on the latter such as -correlation (MA(p)) or some form of asymptotic non-correlation; see White (1999), pp Is β b unbiased and consistent under (4)*? Viewing the testing procedures using the HAC standard errors from the broader perspective of the LR model based on the regression and skedastic functions in (11) what is most surprising is that the original regression function is retained in its static form but the skedastic function is extended to include some form of error autocorrelation. How could that be justified? The traditional discussion pertaining to the presence of error autocorrelation is that the OLS estimator β=(x b > X) 1 X > y retains its unbiasedness and consistency. This claim, however, depends crucially on the validity of the common factor restrictions in (7). When these restrictions are invalid for data Z 0, β b is both biased and inconsistent, which will give rise to majorly unreliable inferences; see Spanos (1986). To bring out how restrictive the traditional strategy adopting the HACSE robustification is we will consider four different scenarios for the simulations that follow. Scenario 1: no common factor restrictions are imposed and the modeler estimates the Dynamic Linear Regression (DLR) model. Scenario 2: no common factor restrictions are imposed and the modeler estimates the Linear Regression (LR) model, ignoring the dynamics in the autoregressive function. Scenario 3: common factor restrictions are imposed and the modeler estimates the Dynamic Linear Regression (DLR) model. Scenario 4: common factor restrictions are imposed and the modeler estimates the Linear Regression (LR) model. The broader probabilistic perspective suggests that both, the regression and skedastic functions, are likely to be affected by the presence of any form of temporal dependence in data Z 0. Hence, the natural way to account for such dependence is to respecify the original model by expanding the relevant conditioning information set to include the past history of the observable process {Z :=( X ) N}: ( X Z 1 )=(X Z 1 )( X Z 1 )=(x Z 1 ) (x :z 1 ) R + (21) That is, any form of temporal dependence in Z 0 should be modeled directly in terms of the stochastic process {Z :=( X ) N}. This gives rise to an extension of the Student s t, Linear/Heteroskedastic model that includes lags of this process. The Student s t Dynamic Linear Regression (DLR) 23

24 model (table 5) can be viewed as a particular parameterization of the vector ( 1) stochastic process {Z :=( X ) N}, assumed to be Student s t with degrees of freedom (df), Markov and stationary. The probabilistic reduction takes the form: (Z 1 Z ; φ) = Q =1 (Z Z 1 ; ϕ)= Q =1 ( X Z 1 ; ψ 1 ) (X Z 1 ; ψ 2 ) with ( X Z 1 ; ψ 1 ) underlying the specification of the St-DLR model in table 5. Note that it is trivial to increase the number of lags to 1 by replacing Markovness with Markoveness of order Table 5: Student s t, Dynamic Linear/Heteroskedastic model Statistical GM: = 0 + β > 1 X + β > 2 Z 1 + N [1] Student s t: (Z Z 1 Z ; ϕ) is Student s t with df, [2] Linearity: ( (X Z 1 ))= 0 ³ + β > 1 X ³ + β > 2 Z 1 [3] Heteroskedasticity: ( (X Z 1 ))= =2+1 2 = 1[(X 2 -μ 2 ) > Q 1 (X -μ 2 )+(Z 1 -μ) > Q 2 (Z 1 -μ) [4] Markov: {Z N} is an Markov process, [5] t-invariance: θ:=( 0 β 1 β μ Q 1 Q 2 ) are t-invariant. The estimation of the above model is based on the same log-likelihood function (see Appendix) as the model in table 2, but now the regressor vector is X := X Z 1 whose number of elements is = Scenario1forHACSE For this case, no common factor restrictions are imposed and the modeler accounts for the temporal structure in the regression function by estimating the Dynamic Linear Regression (DLR) model. This design will be based on the same mean and covariance matrix for both the Normal and the Student s t distributions: ES that gives rise to the true model parameters: =1159 β > 1 =( ) 2 =330 In the case of Normality, this gives rise to the following Dynamic Linear Regression Model (DLR(1)): (22) = N (23) 24

25 The initial conditions: 2 0=33, 0 N(3 12), 10 N(18 1), and 20 N(8 1). In the case of a Student s t distribution with degrees of freedom, (23) remains the same but the conditional variance :=( (X Z 1 )) takes the form: := > The simulations aim to investigate the discrepancies between the actual (empirical) error probabilities and the nominal ones (size and power) associated with the different estimators of ( β)=g b =X > V X G b G b G b =3 under: (a) varying the values of the degrees of freedom parameter (), =4 =8, (b) varying the sample size =50 =100=200 Actual type I error probability (size) of HCSE-based tests. Figure 1 summarizes the size properties of the following forms of the F-test: (i) OLS: β b in conjunction with 2 (X > X) 1 (ii) HACSE-HW: β b in conjunction with the Hansen-White estimator of G =X > V X (iii) HACSE-NW: β b in conjunction with the Newey-West estimator of G =X > V X (iv) HACSE-A: β b in conjunction with the Andrews estimator of G =X > V X (v) F-MLE: the F-test based on the MLEs of both β and its covariance matrix for the Dynamic Student s t Linear Regression Model in table 5. It is important to emphasize at the outset that the scenario used for the simulations that follow includes estimating the DLR(1) in (23) because ignoring the dynamics and just estimating 0 = 1 β > 1 μ 2 β 1 =Σ 1 22 σ 21 2 = 11 σ > 21Σ 1 22 σ 21 ) will give rise to inconsistent estimators, and thus practically useless F-tests with or without the robustification; see Spanos and McGuirk (2001). The general impression conveyed by fig. 14 is that all the HACSE robustified F-type tests exhibit even more serious discrepancies between the actual and nominal type I error of 05 than those in figure 1. The actual type I error is considerably greater than the nominal for =4 and all sample sizes from =50 to =200, ranging from 15 to 61. The only case where the size distortions are somewhat smaller is the case =200 and =8 but even then the discrepancies will lead to seriously unreliable inferences. In contrast, the F-MLE has excellent size properties for all values of and What is also noticeable is the test in (i) is not uniformly worse than the other HACSE-based tests in terms of its size. Indeed, for =8 =50 =100 the OLSbased F-test has less size distortions than the robustified alternative F-tests. Any attempt to rank the HACSE-based F-tests in terms of size distortion makes little sense because they all exhibit serious discrepancies from the nominal size. 25

26 Fig. 14: Size of F-HACSE-based tests for different ( ) The power properties of the HACSE-based tests. Figure 15, shows the power functions of all the HACSE-based F-tests for =50 and =4 and different discrepancies from the null. The plot indicates even more serious power distortions for all the HACSE-based F-tests than those in figure 2. These distortions are brought out more clearly in the size-corrected power given in figure 16. For instance, in the case of.75% discrepancy from the null the HCSE-HW has power.44 and the F-MLE has power equal to.99! Worse, the OLS-based F-test does better in terms of power than any of the HACSE-based F-tests. Indeed, the HACSE-HW is dominated by all the other tests in terms of power. The power curves for all the HACSE-based F-tests are totally dominated by the F-MLE test by serious margins for all scenarios with =4, indicating that the degree of leptokurtosis has serious effects not only on the size but also the power of HCSEbased tests; see figures 19-20, For scenarios with =8 as increases, the power domination of the F-MLE test becomes less pronounced, but still significant; see figures 17-18, and

27 Fig. 15: Power of F-HACSE-based tests for =50 =4 Fig. 16: Size-adjusted power of F-HACSE-based tests for =50 =4 27

28 Fig. 17: Power of F-HACSE-based tests for =50 =8 Fig. 18: Size-adjusted power of F-HACSE-based tests for =50 =8 Despite the increase in the degrees of freedom parameter to =8 all the HACSEbased F-tests for =50 suffer from serious power distortions, with the OLS-based F-test dominating them in terms of both size and power; see figures

29 Fig. 19: Power of F-HACSE-based tests for =100 =4 Fig. 20: Size-adjusted power of F-HACSE-based tests for =100 =4 The serious distortions in both size and power observed above continue to exist in the scenario: (=100 =4) suggesting that the leptokurtosis has a serious effect on both; see figures This particular departure from Normality can be ignored at the expense of the reliability of inference. 29

30 Fig. 21: Power of F-HACSE-based tests for =100 =8 Fig. 22: Size-adjusted power of F-HACSE-based tests for =100 =8 When the degrees of freedom increases to =8 the distortions in size and power are moderated but they still serious enough to call into question the reliability of any inference based on the HACSE-based F-tests; see figures

31 Fig. 23: Power of F-HACSE-based tests for =200 =4 Fig. 24: Size-adjusted power of F-HACSE-based tests for =200 =4 The scenario (=200 =4) brings out the adverse effects of leptokurticity on both thesizeandthepowerofthehacse-basedf-testsdespitethelargesamplesize. 31

32 Fig. 25: Power of F-HACSE-based tests for =200 =8 Fig. 26: Size-adjusted power of F-HACSE-based tests for =200 =8 5.4 Scenario2forHACSE No common factor restrictions are imposed and the modeler estimates the Linear Regression (LR) model, ignoring the dynamics in the autoregressive function, i.e. estimation and the F-test focus exclusively on the static LR model. ################################### 32

33 5.5 Scenario3forHACSE For this case, common factor restrictions are imposed and the modeler takes into account the temporal structure by estimating the Dynamic Linear Regression (DLR) model. As first shown by Sargan (1964), modeling the Markov dependence of the process {Z :=( X ) N} using an AR(1) error term: = 0 +β > 1 x + = 1 + constitutes a special case of the DLR model in (6) subject to common factor restrictions α 3 +α 1 2 =0 It can be shown (McGuirk and Spanos, 2009) that these parameter restrictions imply a highly unappetizing temporal structure for {Z :=( X ) N} The common factor restrictions transform the unrestricted Σ into Σ : Σ = (Z Z 1 )=Σ= 11 (0) σ > 21(0) 11 (1) σ > 21(1) σ 21 (0) Σ 22 (0) σ > 21(1) Σ > 22(1) 11 (1) σ 21 (1) 11 (0) σ > 21(0) σ 21 (1) Σ 22 (1) σ 21 (0) Σ 22 (0) β > Σ 22 (0)β + 2 β > Σ (0) β > Σ 22 (1)β+ 2 β > Σ (1) Σ 22 (0)β Σ 22 (0) β > Σ 22 (1) Σ 22 (1) β > Σ 22 (1)β+ 2 Σ (1) > β β > Σ 22 (0)β+ 2 β > Σ (0) Σ 22 (1)β Σ 22 (1) Σ 22 (0)β Σ 22 (0) The nature of the common factor restrictions can be best appreciated in the context of a VAR(1) model: Z = A > Z 1 + E E v NIID(0 Ω) N (24) entailed by the restricted Σ, which takes the form: µ µ (D A > ρi )β = Ω= 2 + β > Λβ β > Λ 0 D Λβ Λ D =Σ 22 (0) 1 Σ 22 (1) Λ=Σ 22 (0) Σ 22 (1)Σ 22 (0) 1 Σ 22 (1) The question that naturally arises at this stage is whether the error-reliability of the F-test will improve by imposing such a highly restrictive temporal structure on the process {Z :=( X ) N} To ensure the validity of these restrictions the variance-covariance of the scenario 1 needs to be modified slightly to: ES (25) (26)

34 This gives rise to the true model parameters that do satisfy the common factor restrictions: 0 =11448 β > 1 =( ) 2 =317 In the case of Normality, this gives rise to the following Dynamic Linear Regression Model (DLR(1)): = N (27) ######################### 5.6 Scenario4forHACSE Common factor restrictions are imposed and the modeler estimates the Linear Regression (LR) model, ignoring the dynamics in the autoregressive function, i.e. estimation and the F-test focus exclusively on the static LR model. ################################### 6 Summary and conclusions The primary aim of the paper has been to investigate the error-reliability of HCSE/HACSEbased tests using Monte Carlo simulations. For the design of the appropriate simulation experiments it was important to view the departures from the homoskedasticity and autocorrelation assumptions from the broader perspective of regression models based on the first two conditional moments of the same distribution. It was argued that viewing such models from the error term perspective provides a narrow and often misleading view of these departures and the ways they can be accounted for. The simulation results call into question the conventional wisdom of viewing HCSE/HACSE as robustified versions of the OLS covariances. Robustness in this case can only be evaluated in terms of how closely the actual error probabilities approximate the nominal ones. In terms of the latter, it is shown that the various HCSE and HACSE-based F-tests give rise to major size and power distortions. These distortions are particularly pernicious in the case of the presence of leptokurticity. Although further Monte Carlo simulations will be needed to get a more general picture of the error-reliability of these tests, the results in this paper call into question their wide-spread use in practice because they do not, in general, give rise to reliable testing results. Hence, the recommendation to ignore departures from Normality and account for departures from Homoskedasticity/Autocorrelation using HCSE/HACSE is highly misleading. In conclusion, it is important to emphasize that the above simulation results are based on the best case scenario for these HCSE and HACSE-based F-tests where the probabilistic assumptions of [2] linearity, [5] t-invariance and [1] bell-shape symmetry of the underlying conditional distribution are retained. When any of these assumptions are invalid for the particular data, the discrepancy between actual and nominal 34

Revisiting the Statistical Foundations of Panel Data Modeling

Revisiting the Statistical Foundations of Panel Data Modeling Aris Spanos Department of Economics Virginia Tech Blacksburg, VA 24061 USA Last revision: December 2011 Abstract Despite the impressive developments