Topic 5 Extensions to the Basic Framework I ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà
Outline of this topic Heteroskedasticity: reminder of OLS results and White (1980) corrected standard errors, derivation of GLS/GMM results. Testing Autocorrelation: brief introduction to time series, derivation of AR(1) GLS correction, Newey-West standard errors 2
Recall: Asymptotic Distribution of the OLS Estimator t under Homoscedasticity it Recall: ^ =(X 0 X) 1 X 0 ² We know: Where: p n( ^ ) = μ X 0 X n plim X0 X n By Slutsky s theorem 1 μ X 0 ² p n p! Q; and X0 ² d p! N(0; ) n =V (X 0 ²)=E E(X 0 ²² 0 XjX) = ¾ 2 E((X 0 X)) = ¾ 2 Q p n( ^ ) d! N(0;¾ 2 (X 0 X) 1 ) 3
Remarks: 4 To derive this fundamental result we have assumed: We have an i.i.d. sample. However, we will see that this can be relaxed in several ways: 1. To data that is not identically distributed (but independent): useful when heteroskedasticity is suspected 2. To data that is identical but not independent (useful for time series data but we need to limit the amount of dependence) 3. To data that is neither identical nor independent (which is the safest assumption with time series data We used the assumption that =V(X 0 ²)=E E(X 0 ²² 0 XjX) = ¾ 2 E((X 0 X)) = ¾ 2 Q
A More Formal Statement for Asymptotic Normality with Ht Heteroskedastic kd ti Errors Assume: 1. X contains an intercept 2. E(y 2 ) < 1 3. 4. 5. 6. 2 E(x 2 j ) < 1 for j =1; :::; K Q = E(X 0 X)isinvertible Then as 4 4 E(² 4 i ) < 1 and E(x 4 j) < 1 for j =1; :::; K fy i ;x 1i ; :::; x Ki g n i=1 is i:i:d: n!1 p n( ^ ) d! N(0;V)whereV = Q 1 ÐQ 1 5
Specifics Using slightly more detailed notation than before, notice that: p n( ^ ) = Ã 1 n! 1 Ã nx XiX 0 1 i p n i=1 Notice that tx i isa1 K vector for i = 1,,n observations! nx Xi² 0 i We had established that under the assumptions, a LLN applies to 1 X n X 0 X p n ix i! E(X 0 i X i ) = Q i=1 Next we use a CLT and Slutsky s i=1 6
Specifics (cont.) Under the assumptions, the CLT means that 1 p n nx Xi² 0 d i! N(0; Ð) i=1 where: Ð=E(XiX 0 i ² 2 i ). Under homoskedasticity, we saw that t when X d are independent, d then X i and ² i Ð=E(X 0 ix i )E(² 2 i )=Q¾ 2 V = Q 1 ÐQ 1 = Q 1 Q¾ 2 Q 1 = ¾ 2 Q 1 But in general, this will not hold 7
Heteroskedasticity Instead, White (1980) proposed his very famous estimator nx X 0 ^Ð = 1 n n i=1 ix i^² 2 i So that the estimate of the variance-covariance matrix of the ^ is the sandwich estimator ^V = ^Q 1 ^Ð ^Q 1 Formally, one would still need to show that Q ^Q! p Q; Ð ^Ð! p Ð; and hence V ^V! p V 8
Heteroskedasticity: Where Does It Come From? Think of the relation between Food Expenditures and dincome. Wealthier individuals do not necessarily consume more calories, on average, than poorer folk. But they can certainly spend more on how they obtain those calories: think of a Big Mac versus Filet Mignon. However, not all wealthy individuals develop a taste for expensive food (think of a computer programmer for Google recently graduated from college and pasta with Ragu sauce and pizza) 9
A Picture Food Expenditures Total Income 10
Also, recall our own data This assumption is often violated. However, it is easy to relax. It will not affect parameter estimates but it will affect how their standard errors are calculated (i.e., the efficiency of the estimator).. sum testscr if str <= 17 Variable Obs Mean Std. Dev. Min Max testscr 36 660.4139 23.03868 618.05 704.3. sum testscr if str >17 & str <=20 Variable Obs Mean Std. Dev. Min Max testscr 207 656.6229 18.56458 606.75 706.75. sum testscr if str >= 22.8 Variable Obs Mean Std. Dev. Min Max testscr 19 647.0395 16.94368 622.05 676.85. sum testscr if str < 22.8 & str >= 19.8 Variable Obs Mean Std. Dev. Min Max testscr 184 650.3753 17.83076 605.55 694.8 11
Recap White s (1980) correction gives appropriate standard errors but, is there a way to improve efficiency? i In order to discuss how to improve the efficiency of regression estimators, it is helpful to remember the theory of extremum estimation. Remember that extremum estimator theory encompasses the majority of estimation procedures that are commonly used in econometrics (OLS, MLE, and GMM) Here are some basic results we have already seen and which help motivate GLS 12
Basic Theorem for Extremum Eti Estimatorst Consider extremum estimator ^μ obtained from an objective function ^Q n (μ) such that: max ^ ^ The Method of Moments: ^Q n (μ) = " 1 n 0 nx g(y i ;X i ; )# ^W i=1 " 1 n μ2 Q n(μ) =μ # nx g(y i ;X i ; ) where E(g(y i ;X i ; )) = 0 and is a weighting matrix. For example, in deriving the estimator for linear regression we used E(X 0 ²) = E(X 0 (y X )) = 0 hence g(:) ) = X 0 (y X ) and W ^W = I ^W i=1 13
Basic Asymptotic ti Normality Theorem ^Q n (μ) =^μ; ^μ p! μ 0,and(i)μ 0 2 interior( ); Suppose: max μ2 (ii) ^Q n (μ) is twice continuously differentiable in a p neighborhood of ; ;(iii) nr ^Qn (μ 0 )! d N(0; Ð) ; (iv) there is H(μ) that is continuous at and sup μ2n jjr μμ ^Q n (μ) H(μ)jj p! 0 with (v) H = H(μ 0 ) nonsingular. Then: p n(^μ μ0 )! d N(0;H 1 ÐH 1 ) 14
Sketch of the Proof Conditions (i) (iii) imply that r ^Q n (^μ) =0w.p.1 Expand this condition around and use mean- value theorem We know: Since r ^Q n (^μ) =r ^Q n (μ 0 )+ ^H( μ)(^μ ¹ μ 0 ) ^ ^ ^ rq n (μ) = 0 and by (iii) p nrq n (μ 0 )! d N(0; Ð) ¹μ 2 [^μ; μ 0 ]and^μ p! μ 0 then ¹ μ p! μ 0 By (iv) we can show that (after a few intermediate steps): jj ^H( ¹ p μ) H(μ 0 )jj! 0 p Hence: n(^μ ( μ0 0) ) = ^H( μ) ¹ ) 1 rq ^Q (μ! H(μ 1 ^Q n 0 ) ( 0 ) rq n (μ 0 ) 15
An Application to GMM for Linear Regression: A Preview of GLS Use the moment condition: E(X 0 ")=0 Hence: g(yi ;X i ; ) = 1 n nx Xi" 0 i = 1 n i=1 First order conditions: nx Xi(y 0 i X i ) i=1 bq n ( ) = g(y i ;X i ; ) 0 c Wg(yi ;X i ; ) G(y i ;X i ; ^ ) 0 Wg(yi c ;X i ; ^ ) =0 Ã! Ã! 1 nx X 0 1 nx n ix i cw Xi 0 ^" i =0 n i=1 i=1 16
GMM (cont.) Take a mean-value expansion: We know p n G 0 Wg( 0) = G 0 W p 1 n Hence: 0= G( 0) 0 cwg( 0) G( 0) 0 cwg( ) ³ b 0 p n ³ b 0 p ³ n b 0 G 0 p 0! G W G n G W g( 0) nx Xi" 0 i! N(0;G 0 W ÐWG); Ð = E(XiX 0 i " 2 i ) i=1! N(0; (G 0 WG) 1 G 0 W ÐWG(G 0 WG) 1 ) 17
Optimal Weighting Recall: ³ p n ³ b 0! N(0; (G 0 WG) 1 G 0 W ÐWG(G 0 WG) 1 ) Choose: W =Ð 1. Then, things simplify a lot! Recall: p ³ n b 0! N(0; (G 0 Ð 1 G) 1 ) Ã G = ^G = 1 n! nx XiX 0 i i=1 And with White s estimator: nx X 0 bð = 1 n n i=1 ix i^" 2 i 18
Optimal Weighting (cont.) Or, if homoscedasticty holds, then: bð = b¾2 n nx XiX 0 i i=1 And hence, we recover the usual OLS result 0 Ã n! 1 1 p ³ n b 0! N @0; b¾ 2 1 nx X 0 n ix i A GLS consists in choosing the weighting matrix optimally when we cannot make these simplifying assumptions i=1 19
Remarks In general settings, the optimal weighting involves parameter estimates. No problem, since non-optimaloptimal weights still produces consistent parameter estimates, we can go in stages. However, under homoscedasticity, this is not a problem since the optimal weights do not involve parameter estimates and hence we recover the usual OLS formulas. Notice that for the more general case, the GMM returns a GLS procedure where observations are weighed by the inverse of their variances why does this make sense? Intuition? i Be careful keeping track of population versus sample quantities in the derivations 20
A Preliminary Example Suppose you want to simulate an n-dimensional, normally distributed random vector with mean and variance using a random number generator that produces n-dimensional N(0, I) random variates. How would you do it? Any symmetric, positive-definite matrix A with real entries can be decomposed as: where L is a lower triangular matrix with strictly positive diagonal entries. This decomposition is unique although there are other square-root matrices if lower-triangularity is not required of L 21
Example (cont.) Let z» N(0;I) and we want to generate μμ μ 1 1 1 y» N ; 2 1 2 First notice that μ μ μ 1 1 1 0 1 1 = 1 2 1 1 0 1 Hence μ μ μμ μ 1 0 1 1 1 1 y = z +» N ; 1 1 2 2 1 2 22
Sample GAUSS Program z = rndn(100000,2); // generate obs of 2 N(0,1) rvs mean = 1~2; // Desired mean var = (1 1)~(1 2); // desired variance L = chol(var); // Choleski decompostion of var // Notice GAUSS produces an UPPER triangular matrix // Generate y = z*l + mean; "-----------------------------------------"; " Verify the mean of y is correct"; "-----------------------------------------"; "Original Mean"$~"Sample Mean"; mean'~meanc(y); "-----------------------------------------"; " Verify the Co-variance Matrix is correct"; "-----------------------------------------"; "Desired Co-variance matrix"; var; "Sample Co-variance matrix"; vcx(y); 23
Output» run C:\Docs\teaching\240A\STATA\corrnormal.g; ----------------------------------------- Verify the mean of y is correct ----------------------------------------- Original Mean Sample Mean 1.0000000 1.0031228 2.0000000 2.0073557 ----------------------------------------- Verify the Co-variance Matrix is correct ----------------------------------------- Desired Co-variance matrix 1.0000000 1.0000000 1.0000000 2.0000000 Sample Co-variance matrix 1.0015947 1.0015737 1.0015737 2.0034021 24
Why Does this Matter? Suppose you could calculate the variancecovariance matrix for the residuals of each observation in a linear regression model: Ð n n Because it is symmetric and positive definite it has an inverse that is also symmetric and positive definite and hence admits a Choleski decomposition, for example: Ð 1 = PP 0 Pre-multiplying the regression by P on both sides P 0 y = P 0 X + P 0 ² 25
The GLS Estimator Now, using the OLS formulas on the modified regression P 0 y = P 0 X + P 0 ² we have: ^ GLS =(X 0 PP 0 X) 1 (X 0 PP 0 y)=(x 0 Ð 1 X) 1 (X 0 Ð 1 y) u Notice then that E(P 0 ² 0 P ) = P 0 E(²² 0 )P = P 0 Ð 1 1 P = :::=? I E u0 ) (u ²0 P The residuals u are homoscedastic! 26
The GLS Estimator (cont.) The variance of the OLS estimator is, using the usual formula (you should be able to show this): V ( ^ GLS ) = (X 0 Ð 1 X) 1 In fact, the GLS estimator should not be a surprise. We had already derived it using GMM using as the optimal weighting matrix ^W^W Ð 1 27
Practical Observations Notice that I have assumed that Ð is available n n but in practice, it would have to be estimated. Moreover, clearly we need to impose some structure since we would be estimating n(n 1)/2 terms with n observations, clearly infeasible. But even if we had all these terms available, we would be inverting a fairly large matrix, which could be computationally problematic. So next we look at simple structures that we can impose on Ð to obtain reasonable estimates. n n 28
Feasible GLS Let s think back at the food expenditures example, where residuals across individuals are probably not correlated but they do vary with the income of the individual, that is: E(² 2 i jx i )=¾ 2 (X i ) For example, we could conjecture E(² 2 i jx i )=exp(x i ) so as to ensure positivity. Using method of moments arguments and using the residuals from a first stage 29
Feasible GLS (cont.) we could consider fitting the auxiliary regression: 30 log ^² i = X i + v i The v are funny looking residuals but centered at zero, which is what matters to obtain consistent estimates of the These could be used to construct: ^w i =(exp(x i^ )) 1 and hence the weighting g matrix: 0 1 ^w 1 0 ::: 0 0 ^w ^W = 2 ::: 0 W B C ^ FGLS =(X 0 ^W X) 1 (X 0 ^W y) @.... A 0 0 ::: ^w n
Remarks Reasons FGLS may not work in practice: The auxiliary regression, sometimes called the skedastic regression, may be misspecified. The OLS estimator is more robust as it can be justified on projection arguments. The FGLS estimator relies on the correct specification of the conditional mean. The skedastic regression lends itself to test for heteroskedasticity as we could use a typical F-test to assess the null that H 0 : 2 = :::= K =0(i.e. excluding the constant). 31
Recap So far I have focused on situations where I essentially rule out covariation across units, i.e., heteroskedasticity is all about the terms in the diagonal of the covariance matrix not been equal. Hence there are n different terms so with a sample of size n, in principle we cannot estimate consistently any of them. But with FGLS, if we specify a conditional o variance a function, we can reduce dimensionality considerably. Think of White s (1980) correction as a nonparametric correction for heteroskedasticity: the correction does not require consistent estimates but by the same token, it is not as efficient as when one can come up with a valid FGLS specification. 32
Serial Correlation There is a natural environment where the residuals may be correlated across units and that is in time series contexts. So far we have assumed i.i.d. samples, time series clearly violates the independence assumption: GDP today depends on GDP yesterday We will investigate time series in more detail. Here we focus on serial correlation in the residuals, i.e., if the conditional mean were well specified, the residuals would be white noise (uncorrelated over time). However, some correlation in the residuals may still generate consistent estimates of parameters of interest we just have to fix the standard errors! 33
Dependency What is the magic of the i.i.d. sample? Recall, a random sample fz 1 ; :::; z n g n i=1 has a joint density f(z 1 ;:::;z n ; μ) This density, under iid i.i.d., can be factored as: f(z 1 ;:::;z n ; μ) =f(z 1 ; μ):::f(z n ; μ) However, if the data are dependent d f(z 1 ;:::;z t ;:::;z n ; μ) =f(z 1 ; μ)f(z 2 jz 1 ; μ)::: f(z t jz t 1 ;:::;z 1 ; μ):::f(z n jz n 1 ;:::;z 1 ; μ) So we need a way to limit dependence on the past 34
AR(1) One way to limit dependency is with the autoregressive model of order 1: AR(1) Let s focus on the residuals of an OLS regression, an AR(1) structure looks like this: u t = ½u t 1 + ² t ;² t» D(0;¾ 2 ); j½j < 1 Remarks: Notice that by recursive substitution: u t = ² t + ½² t 1 + ½ 2 ² t 2 + ::: hence j½j < 1 ensures dependency on the past diesoff quickly 35
Remarks (cont.) Notice that: ¾ ¾2 2 u = ¾² 2 + ½ 2 ¾² 2 + ½ 4 ¾² 2 + ::: = ¾2 ² 1 ½ 2 ; j½j < 1 And Cov(u t ;u t j )=E(u t ;u t j )=½ j ¾ u 2 Under the assumption j½j < 1, these infinite sums converge and are independent d of t. When the mean, variance and covariances are independent from t, such a process is known as covariancestationary. 2 36
Covariance Residual Matrix Using the formula of the variance and the covariances, we can show that 0 1 ½ ½ 2 ::: ½ n 11 Ð u = ¾2 B ² ½ 1 ½ ::: ½ n 2 B@ C 1 ½ 2... :::. A 1 2 3 ½ n 1 ½ n 2 ½ n 3 ::: 1 So although Ð u has in principle n(n-1)/2 elements, they are all determined by two parameters, ¾ 2 ² and ½ Let s see how the FGLS estimator looks like in this case 37
FGLS with AR(1) residuals 1. Estimate the regression model as usual. Obtain consistent estimates of ^ and more importantly ^u t 2. Estimate the following regression by OLS, for example: ^u t = ½^u t 1 + ² t 3. Given ½ ^½ and ¾ ^¾ 2 ² then construct 0 1 ^½ ^½ 2 ::: ^½ n 1 1 ^¾ 2 ^½ ^½ ^½ n 2 ^W 1 = ^Ð u = ¾ B ² ½ 1 ½ ::: ½ B@ C 1 ^½ 2... :::. A ½ ^½ n 1 ½ ^½ n 2 ½ ^½ n 3 ::: 1 38
A More Direct Approach However, notice that we can correct serial correlation in one fell swoop. Notice: hence u t = y t X t u t 1 = y t 1 X t 1 u t = ½u t 1 + ² t becomes: y t = ½y t 1 + X t ½X t 1 + ² t The model can be estimated with NL maximization (e.g. NLS, GMM) and a test of the null H 0 : ½ =0 as a test of serial correlation. 39
Higher Order Serial Correlation A more general model would allow for more serial dependence, such as u t = ½ 1 u t 1 + ½ 2 u t 2 + ::: + ½ p u t p + ² t from which h we could do as we did with the AR(1) model. Is there something similar to White s (1980) correction for heteroskedasticity for serial correlation? Yes, it is called HAC covariance matrix estimators, the most common one is the Newey-West estimator but it is beyond the scope of this class. 40
A STATA example Suppose I want to estimate Okun s Law, which says: Y Y = c k U where Y is real GDP and U is the unemployment rate. I prefer to look at year-on-year changes to smooth out the data a bit. The file okun.do contains a brief STATA file to estimate this regression and adjusts standard errors with Newey West 41
Basic Commands in Okun.do log using okun.log, replace insheet using "C:\Docs\teaching\240A\STATA\okun.csv", comma gen qtr = q(1948q1) + _n-1 tsset qtr,q gen dy = 100*(gdp - l4.gdp)/l4.gdp gen du = u - l4.u reg dy du // OLS reg dy du, vce(robust) // OLS with White SE newey dy du, lag(4) log close exit 42
Output. reg dy du // OLS Source SS df MS Number of obs = 243 -------------+------------------------------ F( 1, 241) = 812.45 Model 1399.08645 1 1399.08645 Prob > F = 0.0000 Residual 415.017068 241 1.72206252 R-squared = 0.7712 -------------+------------------------------ Adj R-squared = 0.7703 Total 1814.10352 242 7.49629552 Root MSE = 1.3123 ------------------------------------------------------------------------------ dy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- du -1.970018.069115-28.50 0.000-2.106165-1.833871 _cons 3.450868.0843668 40.90 0.000 3.284678 3.617059 ------------------------------------------------------------------------------. reg dy du, vce(robust) // OLS with White corrected SE Regression with Newey-West standard errors Number of obs = 243 maximum lag: 4 F( 1, 241) = 362.70 Prob > F = 0.0000 ------------------------------------------------------------------------------ Newey-West dy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- du -1.970018.1034413-19.04 0.000-2.173782-1.766254 _cons 3.450868.1457643 23.67 0.000 3.163733 3.738003 ------------------------------------------------------------------------------ Linear regression Number of obs = 243 F( 1, 241) = 731.51 Prob > F = 0.0000 R-squared = 0.7712 Root MSE = 1.3123 ------------------------------------------------------------------------------ Robust dy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- du -1.970018.0728382-27.05 0.000-2.113499-1.826537 _cons 3.450868.085788 40.23 0.000 3.281878 3.619858 ------------------------------------------------------------------------------ 43