Single Equation Linear GMM with Serially Correlated Moment Conditions

Single Equation Linear GMM with Serially Correlated Moment Conditions Eric Zivot November 2, 2011

Univariate Time Series Let {y t } be an ergodic-stationary time series with E[y t ]=μ and var(y t ) <. A fundamental decomposition result is the following: Wold Representation Theorem. y t has the representation y t = μ + j=0 ψ j ε t j = μ + ε t + ψ 1 ε t 1 + ψ 0 = 1, j=0 ψ 2 j < ε t MDS(0,σ 2 )

Remarks 1. The Wold representation shows that y t has a linear structure. As a result, thewoldrepresentationisoftencalledthelinear process representation of y t P j=0 ψ 2 j 2. < is called square-summability and controls the memory of the process. It implies that ψ j 0 as j at a sufficiently fast rate.

Variance γ 0 = var(y t ) = var = k=0 k=0 ψ k ε t k ψ 2 k var(ε t) = σ 2 X ψ 2 k < k=0

Autocovariances γ j = E[(y t μ)(y t j μ)] = E k=0 ψ k ε t k h=0 ψ h ε t h j E[(ψ 0 + ψ 1 ε t 1 + + ψ j ε t j + ) (ψ 0 ε t j + ψ 1 ε t j 1 + )] = σ 2 X ψ j+k ψ k, j =0, 1, 2,... k=0

Ergodicity Ergodicity requires It can be shown that implies P j=0 γ j <. j=0 j=0 γ j < ψ 2 j <

Asymptotic Properties of Linear Processes LLN for Linear Processes (Phillips and Solo, Annals of Statistics 1992). Assume j=0 y t = μ + ψ(l)ε t, ε t MDS(0,σ 2 ) = μ + j=0 ψ j ε t j, ψ(l) = ψ(l) is 1-summable; i.e, j ψ j = 1 ψ 1 +2 ψ 2 + < j=0 ψ j L j

Then ˆμ = 1 T ˆγ j = 1 T TX t=1 TX t=1 y t p E[yt ]=μ (y t ˆμ)(y t ˆμ) p cov(y t,y t j )=γ j

CLT for Linear Processes (Phillips and Solo, Annals of Statistics 1992) y t = μ + ψ(l)ε t, ε t MDS(0,σ 2 ) ψ(1) = = μ + j=0 ψ j ε t j, ψ(l) = ψ(l) is 1-summable j=0 ψ j 6=0 j=0 ψ j L j

Then T (ˆμ μ) d N(0, LRV) LRV = long-run variance = γ j j= = γ 0 +2 j=1 = σ 2 ψ(1) 2 = σ 2 γ j, since γ j = γ j 2 ψ j j=0

Intuition behind LRV formula Consider var( T ȳ) =var Using the fact that TX t=1 it follows that var X T y t = 1 T t=1 T var 1 TX y t t=1 y t = 1 0 y, 1 =(1,...,1) 0, y =(y 1,...,y T ) 0 TX y t t=1 =var(1 0 y)=1 0 var(y)1

Now var(y) = E[(y μ1)(y μ1) 0 ] = = Γ γ 0 γ 1 γ 2 γ T 1 γ 1 γ 0 γ 1 γ T 2 γ 2 γ 1 γ 0........ γ T 3. γ T 1 γ T 2 γ T 3 γ 0 where γ j =cov(y t,y t j ) and γ j = γ j. Therefore, var TX y t t=1 = 1 0 var(y)1 = 1 0 Γ1

Now, 1 0 Γ1 = sum of all elements in the T T matrix Γ. This sum may be computed by summing across the rows, or the columns or along the diagonals. Given the band diagonal structure of Γ, it is most convenient to sum along the diagonals. Doing so gives 1 0 Γ1 = Tγ 0 +2(T 1)γ 1 +2(T 2)γ 2 + +2γ T 1

Then 1 (T 1) (T 2) T 10 Γ1 = γ 0 +2 γ T 1 +2 γ T 2 + +2 1 T γ T 1 TX 1 µ = γ 0 +2 1 j γ T j j=1 As T, it can be shown that 1 T 10 Γ1 γ 0 +2 j=1 γ j =LRV

Remark Since γ j = γ j, we may re-express T 1 1 0 Γ1 as 1 T 10 Γ1 = γ 0 + T 1 X j= (T 1) Then, an alternative representation for LRV is Ã 1 j T! γ j LRV = γ j j=

Example: MA(1) Process Y t = μ + ε t + θε t 1, θ < 1 ε t iid (0,σ 2 ) Recall ψ(l) = 1+θL γ 0 = σ 2 (1 + θ 2 ), γ 1 = σ 2 θ Then LRV = γ 0 +2 γ j j=1 = σ 2 (1 + θ 2 )+2σ 2 θ = σ 2 (1 + θ) 2 = σ 2 ψ(1) 2

Remarks 1. If θ =0then LRV = σ 2 2. If θ = 1 then ψ(1) = 0 LRV = σ 2 ψ(1) 2 =0 (a) This motivates the condition ψ(1) 6= 0in the CLT for stationary and ergodic linear processes

Example: AR(1) Process Y t μ = φ(y t 1 μ)+ε t, ε t WN(0,σ 2 ), φ < 1 E[Y t ] = μ Recall, Then ψ(l) = (1 φl) 1 γ 0 = σ 2 LRV = σ 2 ψ(1) 2 = 1 φ 2, γ j = φ j γ 0 σ 2 (1 φ) 2

Straightforward algebra gives LRV = γ 0 +2 γ j j=1 = γ 0 +2 γ 0 X j=1 φ j = σ 2 (1 φ) 2 Remark LRV as φ 1

Estimating Long-Run Variance y t = μ + ψ(l)ε t, ε t MDS(0,σ 2 ) LRV = j= = σ 2 ψ(1) 2 There are two types of estimators γ j = γ 0 +2 γ j j=1 Parametric (assume a parametric model for y t ) Nonparametric (do not assume a parametric model for y t )

Example: MA(1) process A parametric LRV estimate is Y t = μ + ε t + θε t 1, θ < 1 ε t iid (0,σ 2 ) LRV = σ 2 (1 + θ) 2 d LRV = ˆσ 2 (1 + ˆθ) 2, ˆσ 2 p σ 2, ˆθ p θ

Remarks 1. Parametric estimates require us to specify a model for y t 2. For the parametric estimate, we only need consistent estimates for σ 2 and θ (e.g., GMM estimates or ML estimates) in order for LRV d to be consistent 3. Parametric estimates are generally more efficient than non-parametric estimates provided they are based on the correct model

For the general linear process y t = μ + ψ(l)ε t, ε t MDS(0,σ 2 ) a parametric estimator is not possible because ψ(l) = P j=0 ψ j L j has an infinite number of parameters. A natural nonparametric estimator is the truncated sum of sample autocovariances d LRV q = ˆγ 0 +2 qx j=1 ˆγ j q = truncation lag

Problems 1. How to pick q? 2. q must grow with T in order for d LRV q p LRV 3. d LRV q can be negative in finite samples, particularly if q is close to T. This is due to Perceval s result TX j=1 ˆγ j =0

Kernel Based Estimators These are nonparametric estimators of the form (Hayashi notation) where d LRV ker = T 1 X j= (T 1) κ Ã j q(t )! ˆγ j κ( ) = kernel weight function q(t ) = bandwidth parameter

The kernel estimators are motivated by the result which suggests var ³ T ȳ = κ Ã j q(t )! = T 1 X j= (T 1) Ã 1 j T! Ã 1 j T! ˆγ j, q(t )=T

Examples of Kernel Weight Functions Truncated kernel Note q(t ) = q<t ( 1 for x 1 κ(x) = 0 for x > 1 κ Ã j q! = ( 1 0 for j q for j >q

Example: q =2 LRV d Trunc ker = T 1 X j= (T 1) κ Ã j q(t )! ˆγ j = = ˆγ 2 +ˆγ 1 +ˆγ 0 +ˆγ 1 +ˆγ 2 = ˆγ 0 +2(ˆγ 1 +ˆγ 2 ) 2X j= 2 ˆγ j Remarks: 1. The bandwidth parameter q(t )=q acts as a lag truncation parameter 2. Here, d LRV Trunc ker is not guaranteed to be positive if q is close to T

Bartlett kernel q(t ) < T ( 1 x κ(x) = 0 for x 1 for x > 1 d LRV ker computed with the Bartlet kernel gives the so-called Newey-West estimator. Example: q(t )=3 κ Ã j q(t )! = κ µ j 3 = ( 1 j 0 3 for j 2 for j > 2

Then LRV d Bartlett ker = T 1 X j= (T 1) κ Ã j q(t )! ˆγ j = 2X j= 2 Ã 1 j 3! ˆγ j = 3ˆγ 1 2 + 2 3ˆγ 1 +ˆγ 0 + 2 3ˆγ 1 + 1 3ˆγ 2 2 = ˆγ 0 +2 3ˆγ 1 + 1 2 3ˆγ = ˆγ 0 +2 2X j=1 1 j ˆγ 3 j

Remarks 1. Use of the Bartlett kernel guarantees that d and West, 1987 Ecta for a proof) LRV Bartlett ker > 0 (See Newey 2. In order for d p LRV, it must be the case that q(t ) as T. There is a bias-variance tradeoff. Ifq(T ) tooslowlythenthere will be bias; if q(t ) too fast then there will be an increase in variance. LRV Bartlett ker 3. Andrews (1991) Ecta proved that if q(t ) at rate T 1/3 then MSE( d LRV Bartlett ker, LRV) will be minimized asymptotically. 4. Andrews result q(t ) at rate T 1/3 suggests setting q(t )=ct 1/3, c = some constant

However, Andrews result says nothing about what c should be. Hence, Andrews result is not practically useful.

Remarks continued 5. Many software programs use the following rule due to Newey and West (1987) " q(t )=int 4 µ T 100 1/4 # 6. Andrews (1991) studied two other kernels - the Parzen kernel and the quadratic spectral (QS) kernel (these are commonly programmed into software). He shows that q(t ) at rate T 1/5 in order to asymptotically minimize MSE( LRV d ker, LRV). He also showed that MSE( LRV d QS ker, LRV) is the smallest among the different kernels.

Remarks continued 7. Newey and West (1994) Ecta gave Monte Carlo evidence that the choice of bandwidth, q(t ), ismoreimportantthanthechoiceofkernel. They suggested some data-based methods for automatically choosing q(t ). However, as discussed in den Haan and Levin (1997) these methods are not really automatic because they depend on an initial estimate of LRV and some pre-specified weights. See Hall (2005), section 3.5 for more details.

weights 0.0 0.2 0.4 0.6 0.8 1.0 Truncated Bartlett Parzen Quadratic Spectral 1 2 3 4 5 6 7 8 9 10 lags Figure 1: Common kernel functions.

Multivariate Time Series Consider n time series variables {y 1t },...,{y nt }. A multivariate time series is the (n 1) vector time series {Y t } where the i th row of {Y t } is {y it }. That is, for any time t, Y t =(y 1t,...,y nt ) 0. Multivariate time series analysis is used when one wants to model and explain the interactions and co-movements among a group of time series variables: Consumption and income, Stock prices and dividends, forward and spot exchange rates interest rates, money growth, income, inflation GMM moment conditions (g t = x t ε t )

Stationary and Ergodic Multivariate Time Series A multivariate time series Y t is covariance stationary and ergodic if all of its component time series are stationary and ergodic. E[Y t ]=μ =(μ 1,...,μ n ) 0 = var(y t )=Γ 0 = E[(Y t μ)(y t μ) 0 ] var(y 1t ) cov(y 1t,y 2t ) cov(y 1t,y nt ) cov(y 2t,y 1t ) var(y 2t ) cov(y 2t,y nt )...... cov(y nt,y 1t ) cov(y nt,y 2t ) var(y nt ) The correlation matrix of Y t is the (n n) matrix corr(y t )=R 0 = D 1 Γ 0 D 1 where D is a diagonal matrix with j th diagonal element (γ 0 jj )1/2 =var(y jt ) 1/2.

The parameters μ, Γ 0 and R 0 are estimated from data (Y 1,...,Y T ) using thesamplemoments Ȳ = 1 T ˆΓ 0 = 1 T TX t=1 TX t=1 Y t p E[Yt ]=μ (Y t Ȳ)(Y t Ȳ) 0 p var(yt )=Γ 0 ˆR 0 = ˆD 1ˆΓ 0 ˆD 1 p cor(y t )=R 0 where ˆD is the (n n) diagonal matrix with the sample standard deviations of y jt along the diagonal. The Ergodic Theorem justifies convergence of the sample moments to their population counterparts.

All of the lag k cross covariances and correlations are summarized in the (n n) lag k cross covariance and lag k cross correlation matrices Γ k = E[(Y t μ)(y t k μ) 0 ]= cov(y 1t,y 1t k ) cov(y 1t,y 2t k ) cov(y 1t,y nt k ) cov(y 2t,y 1t k ) cov(y 2t,y 2t k ) cov(y 2t,y nt k )...... cov(y nt,y 1t k ) cov(y nt,y 2t k ) cov(y nt,y nt k ) R k = D 1 Γ k D 1 The matrices Γ k and R k arenotsymmetricink but it is easy to show that Γ k = Γ 0 k and R k= R 0 k.

The matrices Γ k and R k are estimated from data (Y 1,...,Y T ) using ˆΓ k = 1 T TX t=k+1 ˆR k = ˆD 1ˆΓ k ˆD 1 (Y t Ȳ)(Y t k Ȳ) 0

Multivariate Wold Representation Any (n 1) covariance stationary multivariate time series Y t has a Wold or linear process representation of the form Y t = μ + ε t + Ψ 1 ε t 1 + Ψ 2 ε t 2 + = μ + k=0 Ψ k ε t k, Ψ 0 = I n ε t WN(0, Σ) Ψ k is an (n n) matrix with (i, j)th element ψ k ij.

In lag operator notation, the Wold form is Ψ(L) = The moments of Y t are given by Y t = μ + Ψ(L)ε t k=0 E[Y t ] = μ var(y t ) = k=0 Ψ k L k Ψ k ΣΨ 0 k

Digression on Vector Autoregressive Models (VARs) The most popular multivariate time series model is the vector autoregressive (VAR) model. The VAR model is a multivariate extension of the univariate autoregressive model. For example, a bivariate VAR(1) model has the form or Ã y1t y 2t! = Ã c1 c 2! + Ã π 1 11 π 1 12 π 1 21 π1 22!Ã y1t 1 y 2t 1 y 1t = c 1 + π 1 11 y 1t 1 + π 1 12 y 2t 1 + ε 1t y 2t = c 2 + π 1 21 y 1t 1 + π 1 22 y 2t 1 + ε 2t! + Ã ε1t ε 2t! where Ã ε1t ε 2t! iid ÃÃ 0 0!, Ã σ11 σ 12 σ 12 σ 22!!

The general VAR(p) modelfory t =(y 1t,y 2t,...,y nt ) 0 has the form Y t = c + Π 1 Y t 1 + Π 2 Y t 2 + + Π p Y t p + ε t, t =1,...,T ε t iid (0, Σ) In lag operator notation, we have Π(L)Y t = c + ε t Π(L) = I n Π 1 L Π p L p Note: For {Y t } to be covariance-stationary and ergodic, there must be restrictions on the elements of the Π k matrices. These are discussed in Econ 584.

If det(π(z)) = 0 has all roots outside the complex unit circle then Y t is covariance stationary and ergodic and the Wold representation for Y t can be found by inverting the VAR polynomial Y t = Π(L) 1 (c + ε t )=μ + Ψ(L)ε t

Long Run Variance Let Y t be an (n 1) stationary and ergodic multivariate time series with E[Y t ]=μ. Phillips and Solo s CLT for linear processes states that T (Ȳ μ) d N (0, LRV) LRV (n n) = X j= Ψ(1)= Γ j = Ψ(1)ΣΨ(1) 0 Ψ k k=0 Since Γ j = Γ 0 j, LRV may be alternatively expressed as LRV = Γ 0 + j=1 (Γ j + Γ 0 j )

Non-parametric Estimate of the Long-Run Variance A consistent estimate of LRV may be computed using non-parametric methods. A popular estimator is the Newey-West weighted autocovariance estimator basedonbartlettweights d LRV = ˆΓ 0 + ˆΓ j = 1 T q(t ) 1 X TX t=j+1 j=1 q(t ) = bandwidth Ã 1 j q(t )! ³ˆΓ j + ˆΓ 0 j (Y t Ȳ)(Y t j Ȳ) 0

Single Equation Linear GMM with Serial Correlation y t = z 0 tδ 0 + ε t, t =1,...,n z t = L 1 vector of explanatory variables δ 0 = L 1 vector of unknown coefficients ε t = random error term x t = K 1 vector of exogenous instruments Moment Conditions and identification for General Model g t (w t, δ 0 ) = x t ε t = x t (y t z 0 tδ 0 ) E[g t (w t, δ 0 )] = E[x t ε t ]=E[x t (y t z 0 tδ 0 )] = 0 E[g t (w t, δ)] 6= 0 for δ 6= δ 0

Serially Correlated Moments It is assumed that {g t } = {x t ε t } is a mean zero, 1-summable linear process g t = Ψ(L)ε t, ε t MDS(0, Σ) satisfying the Phillips-Solo CLT with long-run variance Remark S = LRV = Γ 0 + j=1 (Γ j + Γ 0 j )=Ψ(1)ΣΨ(1)0 Γ 0 = E[g t gt]=e[x 0 t x 0 tε 2 t ] Γ j = E[g t gt j 0 ]=E[x tx 0 t j ε tε t j ] If {g t } is not serially correlated, then Γ j =0for j>0 and LRV = Γ 0 = E[g t g 0 t]

Estimation of LRV LRV is typically estimated using the Newey-West estimator (kernel estimator with Bartlett kernel) d LRV = ˆΓ 0 + ˆΓ 0 = 1 T ˆΓ j = 1 T q(t ) 1 X TX j=1 Ã t=j+1 g t g 0 t = 1 T TX 1 j q(t ) t=j+1 g t g 0 t j = 1 T ˆε t = y t z 0 tˆδ, ˆδ p ˆδ TX! t=j+1 TX t=j+1 ³ˆΓ j + ˆΓ 0 j x t x 0 tˆε 2 t x t x 0 t jˆε tˆε t j

Remark The estimate d LRV is often denoted Ŝ HAC, where HAC denotes heteroskedastic and autocorrelation consistent.

Efficient GMM ˆδ(Ŝ HAC 1 ) = argmin δ = argmin δ J(Ŝ 1 HAC, δ) ng n (δ) 0 Ŝ 1 HAC g n(δ) As with GMM with heteroskedastic errors, the following efficient GMM estimators can be computed 1. 2-step efficient 2. Iterated efficient 3. Continuously-Updated (CU)

Note: the CU estimator solves where ˆδ(Ŝ 1 HAC,CU )=argmin δ + q(t ) 1 X j=1 Ã Ŝ HAC (δ) =ˆΓ 0 (δ) 1 j q(t ) ng n (δ) 0 Ŝ 1 HAC (δ)g n(δ)! ³ˆΓ j (δ)+ˆγ 0 j (δ) ˆΓ j (δ) = 1 T TX t=j+1 ε t (δ) = y t z 0 tδ This can be quite numerically unstable! x t x 0 t j ε t(δ)ε t j (δ)