Heteroskedasticity We now consider the implications of relaxing the assumption that the conditional variance V (u i x i ) = σ 2 is common to all observations i = 1,..., In many applications, we may suspect that the conditional variance of the error term could vary over the observations, in ways that may be diffi cult to model convincingly 1
For example, the variance of shocks to GDP per capita may be quite different for developing countries that are dependent on primary commodity exports compared to developing countries with more diversified export structures, or compared to OECD countries Or the variance of shocks to firm-level total factor productivity (TFP) may be quite different for firms in high-tech sectors compared to firms in low-tech sectors, or for new entrants compared to incumbent firms 2
Allowing the conditional variance V (u i x i ) = σ 2 i to be different for different observations i = 1,..., is referred to as conditional heteroskedasticity We have seen that the consistency property of the OLS estimator does not require the assumption of conditional homoskedasticity We now show that the asymptotic ormality property of the OLS estimator does not require conditional homoskedasticity The asymptotic variance of the OLS estimator has a different form in the more general case of conditional heteroskedasticity, but can still be estimated consistently 3
This allows us to extend the asymptotic Wald tests of restrictions on the parameter vector β to this more general setting As before, we start with a suffi cient set of assumptions to obtain the asymptotic ormality result, and to derive the asymptotic variance of the OLS estimator, in the case of conditional heteroskedasticity 4
i) y i = x i β + u i for i = 1,..., or y = Xβ + u ii) The data on (y i, x i ) are independent over i = 1,...,, with E(u i ) = 0 and E(x i u i ) = 0 for all i = 1,..., but now with E(u 2 i x i) = σ 2 i iii) X is stochastic and full rank iv) The K K matrix M XX = p lim ( X X ) = p lim 1 exists and is non-singular ( ) X v) The K 1 vector u = 1 D x i u i (0, MXΩX ) i=1 where M XΩX = p lim ( X uu X ) = p lim 1 Then ( β OLS β) D (0, M 1 XX M XΩXM 1 XX ) 5 u 2 i x ix i i=1 i=1 x i x i
Assumption (ii), that E(u 2 i x i) = σ 2 i and we have independent observations over i = 1,...,, implies that the conditional variance matrix E(uu X) = Ω is an matrix with elements σ 2 i on its main diagonal, and zeros elsewhere, i.e. E(uu X) = Ω = σ 2 1 0 0 0 σ 2 2 0...... 0 0 σ 2 ote that there are parameters σ 2 1, σ 2 2,..., σ 2 in Ω 6
Consequently we cannot estimate Ω consistently from a sample with observations The number of parameters to be estimated increases at the same rate as the sample size Happily we do not require a consistent estimator of Ω to obtain a consistent estimator of avar( β OLS ) = ( ) 1 M 1 XX M XΩXMXX 1 7
The proof of this asymptotic ormality result follows the same steps that we went through in detail for the case of conditional homoskedasticity We write ) ( βols β = ( X X ) 1 ( ) X u The K K matrix ( X X The K 1 vector ) 1 P M 1 XX from assumption (iv) ( X u ) D (0, MXΩX ) from assumption (v) has the same limit dis- The product rule then implies that ( X X ( tribution as MXX 1 X u ), so that ) 1 ( ) X u ( βols β) D (0, M 1 XX M XΩXM 1 XX ) using the symmetry of M 1 XX 8
As before, assumption (v) can be derived from more primitive assumptions, that allow an appropriate (Liapounov) Central Limit Theorem for independent but not identically distributed random vectors to be used to establish that 1 D x i u i (0, 1 MXΩX ) where M XΩX = p lim u 2 i x ix i i=1 Again this result cannot be applied directly to the case of time series models with lagged dependent variables, since this violates the assumption of independent observations i=1 Section 2.3 of Hayashi (2000) provides a more general asymptotic ormality result for the OLS estimator, which covers this case 9
The assumption that (y i, x i ) are independent over i = 1,..., is replaced by the weaker assumption that the stochastic process {y i, x i } is stationary and ergodic A stochastic process {z t }(t = 1, 2,...) is (strictly) stationary if the joint distribution of (e.g.) (z 1, z 2, z 3 ) is the same as the joint distribution (z 101, z 102, z 103 ) A stationary stochastic process is ergodic if two random variables z t and z t+k become (almost) independent as k increases ote however that for time series models with lagged dependent variables, the assumption that E(x t u t ) = 0 for t = 1,..., T rules out serial correlation in the errors u t 10
For example, if we have the simple dynamic model then y t = β 0 + β 1 y t 1 + β 2 x t + u t y t 1 = β 0 + β 1 y t 2 + β 2 x t 1 + u t 1 So u t 1 is certainly correlated with y t 1 If u t is also correlated with u t 1, which is what we mean by serial correlation of the error term in this context (e.g. if u t = ρu t 1 + e t, where e t is serially uncorrelated), then y t 1 and u t will be correlated, violating the assumption that E[(1, y t 1, x t ) u t ] = 0 In this case, the OLS estimator is inconsistent 11
ow consider the result ( βols β) D (0, M 1 XX M XΩXM 1 XX ) As before, we can use this limit distribution for ( β OLS β) to obtain an approximation to the distribution of β OLS that will be accurate in large (but finite) samples We have where β OLS a (β, avar( β OLS ) = ( ) ) 1 MXX 1 M XΩXMXX 1 ( ) 1 MXX 1 M XΩXMXX 1 12
To make this useful, we require consistent estimators for the K K matrices M XX and M XΩX We have already seen that M XX = ( X X of M XX Similarly M XΩX = 1 since p lim ) provides a consistent estimator û 2 i x ix i provides a consistent estimator of M XΩX, i=1 M XΩX = p lim 1 i=1 û 2 ix i x i = p lim 1 u 2 ix i x i = M XΩX i=1 since y i x i β OLS = û i P ui 13
Then ( M XX 1 M 1 X XΩX M ) ( ) 1 XX = X 1 (X û 2 ix i x X i i=1 ( ) = (X X) 1 û 2 ix i x i (X X) 1 provides a consistent estimator of M 1 XX M XΩXM 1 XX i=1 ) 1 And âvar( β OLS ) = ( ) 1 M XX 1 M 1 XΩX M XX ) ( = (X X) 1 û 2 ix i x i provides a consistent estimator of avar( β OLS ) = ( 1 i=1 (X X) 1 ) M 1 XX M XΩXM 1 XX 14
ow we have ) a β OLS (β, âvar( β OLS ) where ( ) âvar( β OLS ) = (X X) 1 û 2 ix i x i (X X) 1 We can compute âvar( β OLS ) using the data on X and the OLS residuals û i=1 We can then construct asymptotic t-test and Wald test statistics as before, using this heteroskedasticity-consistent estimator of âvar( β OLS ) in place of the estimator we obtained in the special case of conditional homoskedasticity 15
This heteroskedasticity-consistent estimator for the asymptotic variance of the OLS estimator was introduced into the econometrics literature in a paper by White (Econometrica, 1980), one of the most cited papers in econometrics (or economics) in the last 30 years Similar ideas can be found much earlier in the statistics literature, in papers by Huber and by Eicker (both in 1967) The square roots of the elements on the main diagonal of âvar( β OLS ) are variously referred to as heteroskedasticity-consistent standard errors, or heteroskedasticity-robust standard errors, or White standard errors (or some combination of Eicker-Huber-White standard errors) 16
Heteroskedasticity-robust standard errors and test statistics are available in most econometric software To obtain these in Stata, we can use the vce(robust) option within the regress command, e.g. reg y x1 x2, vce(r) Asymptotic inference based on the non-robust estimator âvar( β OLS ) = σ 2 (X X) 1 that we derived under conditional homoskedasticity is valid only under this restrictive assumption 17
But since E(u 2 i x i) = σ 2 is a special case of the more general assumption E(u 2 i x i) = σ 2 i, asymptotic inference based on the robust estimator ( ) âvar( β OLS ) = (X X) 1 û 2 i x ix i (X X) 1 that we derived under condi- i=1 tional heteroskedasticity is also valid (in large samples) if the model happens to satisfy conditional homoskedasticity At no point do we obtain a consistent estimator of the conditional variance matrix Ω = E(uu X) All that we need is a consistent estimator of the K K matrix M XΩX = 1 p lim u 2 i x ix i, which as we have seen can be estimated consistently i=1 as with K fixed 18
In applications where large data samples are available, a common response to the suspicion that conditional heteroskedasticity may be relevant is to continue to use the OLS estimator, and to use heteroskedasticity-consistent standard errors (and test statistics) in place of the traditional standard errors (and test statistics) The OLS estimator remains consistent, and we have a consistent estimator of the asymptotic variance matrix, so asymptotic inference remains valid in large samples 19
That is, we will reject a correct null hypothesis approximately 5% of the time at the 5% significance level (the level or size of the test is approximately correct) And the probability of rejecting a false null hypothesis (the power of the test) increases with the sample size, tending to one in the limit as (the test is said to be consistent) This is sometimes referred to as a passive response to heteroskedasticity The OLS estimator is not asymptotically effi cient in the case of conditional heteroskedasticity, but can still be used to conduct valid hypothesis tests in large samples 20
Testing for heteroskedasticity becomes less important if we have the luxury of using large data samples, and are content to follow this passive strategy Various tests are available that have power to detect conditional heteroskedasticity in the OLS residuals, or to reject the null hypothesis of conditional homoskedasticity White (Econometrica, 1980) suggested regressing the squared OLS residuals û 2 i on a constant and on all the explanatory variables, and their squares and cross-products 21
For example, in the model with K = 3 and an intercept term we run the regression y i = β 1 + β 2 x 2i + β 3 x 3i + u i û 2 i = γ 1 + γ 2 x 2i + γ 3 x 3i + γ 4 x 2 2i + γ 5 x 2 3i + γ 6 (x 2i x 3i ) + v i and test the restriction H 0 : γ 2 = γ 3 =... = γ 6 = 0 (which is implied by the conditional homoskedasticity assumption E(u 2 i x i) = σ 2 for all i = 1,..., ) 22
The basic idea is that we specify σ 2 i = E(u 2 i x i) to be some unknown function f(z i ) of a vector of observed variables z i We use the squared residuals û 2 i as a proxy for u2 i, and we use this polynomial approximation to the unknown function f(z i ) Earlier tests for heteroskedasticity based on similar ideas include those proposed by Glejser (JASA, 1969), Ramsey (JRSS(B), 1969), Goldfeld and Quandt (1972) and Breusch and Pagan (Econometrica, 1979) 23
In some models with conditional heteroskedasticity, we can obtain more effi cient estimators that OLS if we are willing to model the form that this heteroskedasticity takes This active response to heteroskedasticity may be more appropriate in applications where effi ciency is considered to be a more important concern To see the basic idea, we first consider a version of the classical linear regression model with a known form of conditional heteroskedasticity 24
Generalized Least Squares We assume E(y X) = Xβ V (y X) = Ω, with Ω σ 2 I a known, positive definite conditional variance matrix X is stochastic and full rank Because Ω is positive definite and known, we can find a non-stochastic matrix H such that H H = Ω 1 and HΩH = I 25
Let y = Hy and X = HX ow E(y X) = HE(y X) = HXβ = X β V (y X) = HV (y X)H = HΩH = I ( = σ 2 I for σ 2 = 1) X = HX is stochastic and full rank The transformed model y = X β + u with V (u X ) = I is a classical linear regression model with conditional homoskedasticity, which satisfies the assumptions of the Gauss-Markov theorem 26
Aitken s theorem The OLS estimator of β in this transformed model, known as the Generalized Least Squares (GLS) estimator, is effi cient (in the class of linear, unbiased estimators) β GLS = (X X ) 1 X y = (X H HX) 1 X H Hy = (X Ω 1 X) 1 X Ω 1 y 27
We can replace H here by any matrix which is proportional to H, and still obtain the GLS estimator If we use H = ah for some scalar a, we have the transformed variables ỹ = Hy = ahy and X = HX = ahx, giving ( X X) 1 X ỹ = (a 2 X H HX) 1 a 2 X H Hy = (X Ω 1 X) 1 X Ω 1 y = β GLS 28
Under the further ormality assumption y X (Xβ, Ω) the GLS estimator is also the conditional Maximum Likelihood estimator in this particular model where Ω is known β GLS also has a ormal distribution in this case, with β GLS X ( β, (X Ω 1 X) 1) 29
In practice this GLS estimator cannot be computed, since we don t know the conditional variance matrix Ω The Feasible Generalized Least Squares (FGLS) estimator replaces the unknown Ω by an estimator Ω, giving β F GLS = (X Ω 1 X) 1 X Ω 1 y This can be computed using y = Ĥy and X = ĤX, where we now require Ĥ Ĥ = Ω 1 and Ĥ ΩĤ = I [or we can use any matrix that is proportional to Ĥ] 30
The properties of the FGLS estimator depend on the properties of Ω as an estimator of Ω If Ω is a consistent estimator of Ω, then under quite general conditions we find that β F GLS has the same asymptotic distribution as the infeasible β GLS, giving β F GLS a ( β, (X Ω 1 X) 1 ) In this case, β F GLS is also asymptotically effi cient However obtaining a consistent estimator of Ω is not straightforward 31
In general, the symmetric matrix Ω has (+1)/2 distinct elements Even if we restrict all the off-diagonal elements to be zero, as is natural in a cross-section regression context with independent observations, so that σ 2 1 0 0 0 σ 2 2 0 Ω =...... 0 0 σ 2 we still have distinct elements These cannot be estimated consistently from a sample of size 32
Consistent estimation requires us to specify a (parametric) model for Ω of the form Ω = Ω(φ), where Ω(φ) is a function of the vector φ which contains a finite number of additional parameters, not increasing with the sample size, which can be estimated consistently from the data If this specification of the conditional heteroskedasticity V (y X) = Ω(φ) is correct, and we can find a consistent estimator φ for φ, we can then use the consistent estimator Ω = Ω( φ) to obtain the FGLS estimator 33
As a very simple example (in which implementation does not require the estimation of any additional parameters), we could specify the conditional variance V (y i X) = V (u i X) = σ 2 i to be proportional to the squared values of one of the regressors, say x Ki, giving σ 2 i = σ2 x 2 Ki We then let y i = y i x Ki and x ki = x ki x Ki for each k = 1,..., K (notice how this transformation affects the intercept in the model) In the transformed model y i = x i β + u i we then have V (y i X) = V (u i X) = σ2 i x 2 Ki = σ2 x 2 Ki x 2 Ki = σ 2 for all i = 1,..., 34
This transformed model then satisfies conditional homoskedasticity, and we can compute the FGLS estimator here simply as the OLS estimator in the transformed model Feasible GLS estimators of this kind are also known as Weighted Least Squares estimators, since we weight each observation by a factor which is proportional to (an estimate of) 1 σ i ote that the transformation gives less weight to observations where the variance of u i is (estimated to be) relatively high, and more weight to observations where the variance of u i is (estimated to be) relatively low 35
If our specification for the conditional variance V (y X) = Ω(φ) is correct, and we estimate Ω(φ) consistently, this weighting is the source of the effi ciency gain compared to OLS 36
If we specify the conditional variance V (y X) = Ω = Ω(φ) and further assume that y X (Xβ, Ω(φ)) then the feasible GLS estimator is not the conditional Maximum Likelihood estimator in the case where φ is unknown FGLS uses a consistent estimator of φ to construct a consistent estimator of Ω, and then maximizes L(β, Ω) = L(β, Ω( φ)) with respect to β The conditional Maximum Likelihood estimator maximizes the likelihood function L(β, Ω) = L(β, Ω(φ)) with respect to β and φ jointly 37
These estimators are different, unless we happen to have φ = φ ML, giving Ω = Ω ML, in which case L(β, Ω ML ) is a concentrated likelihood function, and maximizing L(β, Ω ML ) with respect to β also yields the conditional Maximum Likelihood estimator β ML In most applications of Feasible GLS, we do not have φ = φ ML Then β F GLS β ML, although they are asymptotically equivalent (i.e. they have the same asymptotic distribution) under quite general conditions 38
The consistency properties of β F GLS and β ML in this linear model do not depend on the parametric specification of Ω = Ω(φ) being correct But the effi ciency advantages of these estimators relative to β OLS may not hold if this specification for the form of the conditional heteroskedasticity is not correct These estimators do not extend straightforwardly to linear models that do not satisfy the linear conditional expectation assumption E(y i x i ) = x i β 39