E 4101/5101 Lecture 9: Non-stationarity Ragnar Nymoen 30 March 2011
Introduction I Main references: Hamilton Ch 15,16 and 17. Davidson and MacKinnon Ch 14.3 and 14.4 Also read Ch 2.4 and Ch 2.5 in Davidson and MacKinnon if you are not familiar with the Frisch-Waugh-Lovell theorem,which we often make reference to.
Deterministic trend trend stationarity I Let {y t ; t = 1, 2, 3,...T } define a time series (as before). y t follows a pure deterministic trend (DT) if where ε t is white-noise and Gaussian. y t is non-stationary, since y t = φ 0 + δt + ε t, δ = 0 (1) even if the variance does not depend on time: E (y t ) = φ 0 + δt (2) Var(y t ) = σ 2 (3)
Deterministic trend trend stationarity II Assume that we are in period T and want a forecast for y T +h. Assume that φ 0 and δ are known parameters in period T. The forecast is: ŷ T +h T = φ 0 + δ(t + j) Assuming that the DGP is (1) also in the forecast period, the forecast error becomes: y t+h ŷ T +h = ε T +j with E [(y t+h ŷ T +h ) T ] = 0
Deterministic trend trend stationarity III and variance: Var(y t+h ŷ T +h ) T ] = σ 2 The conditional variance is the same as the unconditional variance (in the pure DT model). In the pure DT model, non-stationarity is purged by de-trending. The de-trended variable: y s t = y t δt Var(y s t ) = σ 2 and E (y s t ) = φ 0 yt s is covariance stationary. Since stationarity of yt s is obtained by subtracting the linear trend δt from y t in (1), y t is called a trend-stationary process.
Estimation and inference in the deterministic trend model I Since the deterministic trend model can be placed within the ARMA class of models, it represents no new problems of estimation. Still, the precise statistical analysis is not trivial, as Ch 16.1 in H shows. Specifically, Hamilton (Ch. 16, p 460) shows that for (1) y t = φ 0 + δ t + ε t, t = 1, 2,... and ε t i.i.d. with Var(ε t ) = σ 2 and E (ε 4 t) <,we have OLS estimators ˆφ 0 and ˆδ, ( T 1/2 ) ( ( ) ( ˆφ 0 φ 0 ) D 0 1 1 1 ) T 3/2 N ( ˆδ δ) 0, σ2 2 1 2 1 3
Estimation and inference in the deterministic trend model II The speed of convergence of ˆδ is T 3/2 (sometimes written as O p (T 3/2 ), for order in probability) while the usual speed of convergence for stationary variables is T 1/2 ˆδ is said to be super-consistent. It implies that plim( ˆδ) = δ and plim(t ˆδ) = δ. (4) However, the OLS based Var( ˆδ) has the same property. This means that the usual tests statistics have asymptotic N and χ 2 distributions as in the ARMA case., see H 16.2.
AR model with trend I A slightly model general DT model: y t = φ 0 + φ 1 y t 1 + δt + ε t, φ 1 < 1, δ = 0 (5) Conditional on y 0 = 0, the solution is y t = φ 0 t j=0 φ j 1 δ t 1 ( t + δ φ j 1 1 j=1 ) j=1 t + ( ) φ j 1 j (6) t j=0 φ j 1 ε t If we define ( ) t yt s = y t δ φ j 1 1 t j=1
AR model with trend II we get that also this de-trended variable is covariance stationary: E (y s t ) = Var(y s t ) = φ 0 (1 φ 1 ) δ φ 1 (1 φ 1 ) 2 σ 2 (1 φ 2 1 ) where the result for E (y s t ) makes use of δ t 1 j=1 ( ) φ j 1 j t δ φ 1 (1 φ 1 ) 2
OLS estimation of models with deterministic trend I Above, saw that y t AR(1) + trend yt s AR(1). But how can we estimate (φ 0, φ 1, δ)? can be transformed to With reference to Frisch-Waugh-Lovell theorem (Davidson and MacKinnon Ch 2.4), ˆφ 1 from OLS on y t = φ 0 + φ 1 y t 1 + δt + ε t (7) is identical to the regression coefficient in the regression between the de-trended residuals for y t and y t 1. Hence most practical to obtain OLS estimates for (φ 0, φ 1, δ) in one step: By estimation of (7).
OLS estimation of models with deterministic trend II This extends to y t AR(p) + trend and to y t ARDL(p, p) + trend Distribution of OLS estimators: For the pure DT model we saw that ˆδ is super-consistent (converge at rate T 3/2 ). In the AR(p) + trend model the OLS estimators of all individual parameters, for example ( ˆφ 0, ˆφ 1, ˆδ) are consistent at the usual rates of convergence ( T ).
OLS estimation of models with deterministic trend III The reason why ˆδ is no longer super-consistent in the AR(1) + trend model, is that ˆδ is a linear combination of variables that converge at different rates. In such situation the slowest convergence rates dominates, it is T. See H Ch 16.3 p 463-467. The practical implication is that the stationary asymptotic distribution theory can be used also for dynamic models that include a DT For the AR(p) + trend or ARDL(p, p) + trend the conditional mean and variance of course depends on time, just as in the model without trend: Adds flexibility to pure DT model.
Other forms of deterministic non-stationarity I In one important sense, the model with DT is just a special case of y t = φ 0 + φ 1 y t 1 + δd(t) + ε t where D(t) is any deterministic (vector) function of time. It might be: Seasonal dummies, or Dummies for structural breaks (induce shifts in intercept and/or φ 1, gradually or as a deterministic shock)) As long-as the model with D(t) can be re-expressed at a model with constant unconditional mean (with reference to the FWL theorem), this type of non-stationarity has no consequence for the statistical analysis of the model.
Forecasting with pure DT model 13.000 lo g ( BN Pcap ) 12.975 12.950 12.925 12.900 12.875 12.850 12.825 12.800 Global trend forecast (same as fitted values) Local trend forecast conditional on 2007 log(b NPcap) Local trend forecast conditional on 2008 Local trend forecast conditional on 2009 Outcome 2010 (preliminary data) 2007 2008 2009 2010 2011 20 Forecasting GNP per capita and by DT local trend model
Back-casting also problematic! In 1975 Norwegian recruits averaged 179 cm, and the increase was 0.8 mm a year Back-casting shows that the feared Vikings were no more than 30 cm high! See Aukrust (1977) for this piece of research!
Stochastic (or local) trend I AR(p): y t = φ 0 + φ(l)y t 1 + ε t (8) φ(l) = φ 1 L + φ 2 L 2 +... + φ p L p. Re-writing the model in the (now) usual way: y t = φ 0 + φ (L) y t 1 (1 φ(1)) y t 1 + ε t (9) }{{} =p(1) The parameters φ i in are functions of the φ i s. φ (L) = φ 1 L + φ 2 L2 +... + φ p 1 Lp 1 (10)
Stochastic (or local) trend II We know from before that y t is stationary and causal if all roots of p(λ) = λ p φ 1 λ p 1... φ p λ (11) have modulus less than one. In the case of λ = 1 (one root is equal to 1), p(1) = 1 φ(1) = 0. (12) and (9) becomes p 1 y t = φ 0 + i=1 φ i y t i + ε t. (13)
Stochastic (or local) trend III Definition y t given by (8) is integrated of order 1, y t I (1), if p(λ) = 0 has one characteristic root equal to 1. The stationary case is often referred to as y t I (0), integrated of order zero. It follows that if y t I (1), then y t I (1). An integrated series y t is called difference stationary. With reference to our earlier discussion of stationarity and spectral analysis, we see that the definition above is not general: The characteristic polynomial of an AR(p) series can have other unit-roots than the real root 1.
Stochastic (or local) trend IV The real root 1 corresponds to a root at the long-run frequency. It implies that y t I (1) series has Granger s typical spectral shape. In the following, we will abstract from unit roots at the seasonal or business cycle frequencies. We do not cover seasonal integration, Hylleberg et. al. (1990) is the classic reference. We will however later consider the analysis of variables that are integrated of order 2: y t I (2) if 2 y t I (0), where 2 = (1 L) 2. In the I (2) case, there must be a unit root in the characteristic polynomial associated with (13): p(λ ) = λ p 1 φ 1 λp 2... φ p 1.
Contrasting I(0) and I(1) I : I(1) I(0) 1 Var[y t ] = finite 2 Corr[y t, y t p ] 1 0 3 Multipliers Do not die out 0 4a Forecasting y T +h E (y T +h T ) depends on y T h E (y t ) h 4b Forecasting,y T +h Var of forecast errors finite 5 PSD Typical shape Finite at all v 6 Inference Non-standard theory Standard 1-4 are easy to demonstrate for the Random Walk (RW) with drift: y t = φ 0 + y t 1 + ε t, (14) in fact we have worked with this is a seminar exercise.
Contrasting I(0) and I(1) II # 5 follows from our excursion into Spectral analysis. We now begin considering the inference aspects of models with I (1) variables. We start by a demonstrating what turns out to be the fundamental problem of standard inference theory, and then make use of the (non-standard) statistical theory that makes it possible to make valid inference in the I (1)-case.
Spurious regression I Granger and Newbold (1974) observed that 1. Economic time series were typically I (1); 2. Econometricians used conventional inference theory to test hypotheses about relationships between I (1) series G&N used Monte-Carlo analysis to show that 1. and 2. imply that to many significant relationships are found in economics Seemingly significant relationships between independent I (1) variables were dubbed spurious regressions.
Spurious regression II To replicate G&N results, we let YA t and YB t be generated by where ( εa,t ε B,t YA t = φ A1 YA t 1 + ε A,t YB t = φ B1 YB t 1 + ε B,t ) N (( 0 0 ) ( σ 2, A 0 0 σb 2 )). The DGP is a 1st order VAR. YA t, YB t are independent random walks if φ A1 = φ B1 = 1, and stationary if φ A1 and φ B1 < 1. The regression is YA t = α + βyb t + e t and the hypothesis is H 0 : β = 0. We consider the stationary GDP first, then the non-stationary DGP.
Summary of Monte-Carlo of static regression With stationary variables: wrong inference (too high rejection frequencies) because of positive residual autocorrelation but ˆβ is consistent With I(1) variables: rejection frequencies even higher and growing with T Indication that ˆβ is inconsistent under the null of β = 0.... what is the distribution of ˆβ?
Dynamic regression model I In retrospect we can ask: Was the G&N analysis a bit of a strawman? After all,the regression model is obviously mis-specified. And the true DGP is not nested in the model. To check: use same DGP, but replace static regression by YA t = φ 0 + ρya t 1 + β 0 YB t + β 1 YB t 1 + ε At (15) Under the null hypothesis: ρ = 0 β 0 = β 1 = 0 and there is no residual autocorrelation, neither under H 0, nor under H 1.
Dynamic regression model II The ADL regression model (15) performs better than the static regression, for example, t ˆβ 0 seems to behave as in the stationary case. This does hold true in general, since β 0 is a coefficient on a stationary variable, Sims, Stock and Watson (1990), H 464, 518 But inference based on t ˆβ 1 and t ˆβ 1 leads to over-rejection (the size of the test is wrong) also in the dynamic model. Conclude that G&N s spurious regression problem is not only a product of misspecification. Non-stationarity also harms the correctly specified regression. Need a new and more general inference theory before can finally solve the spurious regression problem
The Dickey Fuller distribution I We now let the Data Generating Process (DGP) for y t I (1) be the simple Gaussian Random Walk: We estimate the model y t = y t 1 + ε t, ε t N(0, σ 2 ) (16) y t = ρy t 1 + u t, (17) where our choice of OLS estimation is based on an assumption about white-noise disturbances u t. At the start, we know that, since the model can be written as y t = (ρ 1)y t 1 + u t
The Dickey Fuller distribution II the OLS estimate (ρ 1) is consistent: The stationary (finite variance) series y t cannot depend on the infinite variance variable y t 1, see Hamilton p 488. However consistency alone doesn t guarantee that T ( ˆρ 1) has a normal limiting distribution in this case (ρ = 1). In fact, T ( ˆρ 1) has a degenerate asymptotic distribution since it can be shown that the speed of convergence is T when ρ = 1 in the DGP, another instance of super consistency.
The Dickey Fuller distribution III We therefore seek the asymptotic distribution of the OLS based stochastic variable: when the DGP is (16). Numerator in (18): 1 T T ( ˆρ 1) = T y t 1 ε t i=1 1 T T t=1 y t 1 ε t. (18) 1 T 2 T t=1 yt 1 2 L 1 T. 2 σ2 (X 1), (19) where the stochastic variable X is distributed χ 2 (1), see Hamilton s Ch 17.1
The Dickey Fuller distribution IV The denominator in (18), see Hn p 487 1 T 2 T yt 1 2 t=1 1 L T. σ2 [W (r)] 2 dr, (20) 0 where W (r), r [0, 1], is a Standard Brownian motion, see Ch 17.2 for an explanation. W (r) is a process that defines stochastic variables for any r. For example: W (1) N(0, 1), but W (r) is something different than the normal distribution when r < 1.
The Dickey Fuller distribution V The result in (20) is an application of two theorems: The functional limit theorem, which is a generalization of the central limit theorem to the random-walk case, and the Continuous mapping theorem, which generalizes Cramer s theorem, see Ch 17.3 The asymptotic distribution of T ( ˆρ 1) is T ( ˆρ 1) L T. 1 (X 1) 2 1 0 [W (r)]2 dr (21)
The Dickey Fuller distribution VI χ 2 (1) is heavily skewed to the left (towards zero). Only 32% of the distribution lies to the right of 1. This means that values of X that makes the numerator in (21) negative have probability 0.68. The denominator is always positive. As a result, we see that negative ( ˆρ 1) values will be over-represented when the true value of ρ is 1.
The Dickey Fuller distribution VII The distribution in (21) is called an Dickey-Fuller (D-F) distribution. Under the H 0 of ρ = 1, also the t-statistic from OLS on (17) has a Dickey-Fuller distribution, which is of course relevant for practical testing of this H 0. t DF = ˆρ 1 se( ˆρ) (22) where se( ˆρ) = ˆσ 2 ˆσ 2 = 1 T 1 T t=1 y 2 t 1 T t=1 (y t ˆρy t 1 ) 2
The Dickey Fuller distribution VIII ˆσ 2 is consistent since ˆρ is consistent. Written out, t DF is: t DF = 1 T ( ˆρ 1) T 2 T t=1 yt 1 2 ˆσ 2 Substitute for T ( ˆρ 1) in the numerator: t DF = 1 T T t=1 y t 1 ε t 1 T 2 T t=1 yt 1 ˆσ 2 2
The Dickey Fuller distribution IX The numerator converges to (19), while we can use (20) in the denominator, so that 1 T 2 and T yt 1 ˆσ 2 2 t=1 L T. t DF L T. 1 1 σ 2 [W (r)] 2 dr σ = σ 2 [W (r)] 2 dr, 0 0 1 (X 1) 2 1 0 [W (r)]2 dr Which is not a normal distribution. Intuitively, because of the skewness of X, the left-tail 5 % fractile of this Dickey-Fuller distribution will be more negative than those of the normal. (23)
Dickey-Fuller tables and models I The distributions (20) and (23) have been tabulated by Monte-Carlo simulation, see Hamilton s Table B.5 and B.6. The tables are based on the following, DGP, ( True process in Hamilton) y t = y t 1 + ε t,, ε t N(0, σ 2 ), y 0 = 0. (24) Table B.6 gives the distribution of the t-value of ( ˆρ 1) from the following models y t = (ρ 1)y t 1 + u t y t = µ + (ρ 1)y t 1 + u t y t = µ + (ρ 1)y t 1 + γ t + u t (a) (b) (c)
Dickey-Fuller tables and models II Model a. is the case of model = DGP under H 0 : ρ = 1, the case above, and is Case 1 in Table B.6. Case 2 and Case 4 in Table B.6 show that the two t DF have different distributions when model (b) or (c) are estimated (the DGP is still (24). Case 2. The model contains an intercept (which is not present in the DGP). The fractiles are shifted to the left. Case 4. Model contains µ and trend, γt, (drift may or may not be in the DGP).
Similar tests I The DF-test seems to depend on DGP = model (24). In practice we don t know if the DGP, in addition to potentially ρ = 1, also contains so called nuisance parameters, in the form of constant terms and trend. Tests that give correct inference about parameters of interest ( e.g., 5 % prob of rejection of ρ = 1,when H 0 is true) even when DGP contains nuisance parameters are called similar tests.
Similar tests II It turns out that the D-F tables can be used to construct similar tests: DGP Models that give similar tests: i) y t = y t 1 + ε t, y 0 = 0. (a), (b), (c) ii) y t = y t 1 + ε t, arbitrary y 0 (b), (c) iii) y t = µ + y t 1 + ε t, arbitrary y 0 (c) given that the correct tables are used! (25) The rule of thumb is that the model should be congruent with DGP, or include deterministic terms of higher order than those in the DGP.
A special case with standard inference I If the DGP is while the model is of type b: y t = µ + y t 1 + ε t y t = µ + (ρ 1)y t 1 + u t, (26) the t-value of ( ˆρ 1) becomes N(0, 1), as in Hamilton s Case 3. It is as if the regressor in (26) is a deterministic trend rather than y t 1. The reason is that, under H 0, there is a DT in y t (remember the solution!) in addition to a stochastic trend (ST). And, the DT dominates the properties of y t when T becomes large.
Augmented Dickey-Fuller tests I Let the DGP be the AR(p) y t p i=1 φ i y t i = ε t (27) with ε t N(0, σ 2 ). We have the reparameterization: y t = p 1 φ y i t i (1 φ(1))y t 1 + ε t (28) i=1 y t I (1) is implied by (1 φ(1)) ρ = 0 But a simple D-F regression will have autocorrelated u t in the light of this DGP: one or more lag-coefficient φ = 0 are omitted. i
Augmented Dickey-Fuller tests II The augmented Dickey-Fuller test (ADF), see Ch 17.7, is based on the model y t = k 1 b i y t i + (ρ 1)y t 1 + u t (29) i=1 Estimate by OLS and calculate the t DF form this ADF regression. The asymptotic distribution is that same as in the first order case (with a simple random walk). The degree of augmentation can be determined by a specification search. Start with high k and stop when a standard t-test rejects null of b k 1 = 0
Augmented Dickey-Fuller tests III The determination of lag length is an important step in practice since Too low k destroys the level of the test (dynamic mis-specification), Too high k lead to loss of power (over-parameterization). The ADF test can be regarded as one way of tackling unit-root processes with serial correlation Another much used test procedure, know as Phillips-Perron tests, are covered by Hamilton Ch 17.5. Davidson and MacKinnon also mentions alternatives to ADF, on page 623.
Augmented Dickey-Fuller tests IV The are several other tests for unit-roots as well including tests where the null-hypotheses is stationarity and the alternative is non-stationary. As one example of the continuing interest in these topics: The book by Patterson (2011) contains a comprehensive review.
References Aukrust, K. (1977) Ludvik Helge Erichsens Forlag, Oslo Granger C.W.J and P. Newbold (1974) Spurious Regressions in Econometrics, Journal of Econometrics, 2, 111-120. Hylleberg, S, R. F. Engle, C. W. J. Granger and B. S. Yoo (1990) Seasonal Integration and Cointegration, Journal of Econometrics, 1990, 44, 215-238. Patterson, K. (2011), Unit Root Tests in Time Series. Volume 1: Key Concepts and Problems, Palgrave MacMillan. Sims, C., J. Stock, and M. Watson (1990) Inference in Linear Time Series Models with some Unit Roots Econometrica, 58, 113 44