Time series models: sparse estimation and robustness aspects

Size: px

Start display at page:

Download "Time series models: sparse estimation and robustness aspects"

Sarah Kelly
5 years ago
Views:

1 Time series models: sparse estimation and robustness aspects Christophe Croux (KU Leuven, Belgium) 2017 CRonos Spring Course Limassol, Cyprus, 8-10 April Based on joint work with Ines Wilms, Ruben Crevits, and Sarah Gelper.

2 Part I Autoregressive models: from one time series to many time series 2 / 164

3 Outline 1 Univariate 2 Multivariate 3 Big Data 3 / 164

4 Univariate Section 1 Univariate 4 / 164

5 Univariate One time series Sales growth of beer in a store: weekly data y Week Week Week Week Week Week Week / 164

6 Univariate Time Series Plot sales growth of Beer sales growth week 6 / 164

7 Univariate Autoregressive model Autoregressive model of order 1: y t = c + γ y t 1 + e t for every time point t = 1,..., T. Here, e t stands for the error term. Estimate c and γ by using ordinary least squares. 7 / 164

8 Univariate R-code and Output > plot(y,xlab="week",ylab="sales growth",type="l",main="sales growth of Beer") > y.actual=y[2:length(y)] > y.lagged=y[1:(length(y)-1)] > mylm<-lm(y.actual~ y.lagged) > summary(mylm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) y.lagged e-07 *** Multiple R-squared: 0.287,Adjusted R-squared: Questions: (i) what is the estimate of γ (ii) is it significant (iii) interpret its sign (iv) interpret R 2 8 / 164

9 Univariate Forecasting Using the observations y 1, y 2,..., y T we forecast the next observation uisng the formula ŷ T +1 = ĉ + ˆγ y T Can we also forecast at horizon 2? YES ŷ T +2 = ĉ + ˆγ ŷ T +1 9 / 164

10 Univariate T = 76, forecasts at horizon 1, 2, 3, and 4 sales growth of Beer + forecasts sales growth week 10 / 164

11 Multivariate Section 2 Multivariate 11 / 164

12 Multivariate 3 time series Predict next week sales using the observed Sales + additional information, like Price Marketing Effort Three stationary time series y 1,t, y 2,t, and y 3,t. 12 / 164

13 Multivariate Sales growth Marketing increase Price growth week 13 / 164

14 Multivariate Prediction model for first time series: y 1,t = c 1 + γ 11 y 1,t 1 + γ 12 y 2,t 1 + γ 13 y 3,t 1 + e 1t Forecast at horizon 1? YES ŷ 1,T +1 = ĉ 1 + ˆγ 11 y 1,T + ˆγ 12 y 2,T + ˆγ 13 y 3,T Forecast at horizon 2? NO ŷ 1,T +2 = ĉ 1 + ˆγ 11 ŷ 1,T +1 + ˆγ 12 y 2,T +1 + ˆγ 13 y 3,T +2 Why Not? y 2,T +1, y 3,T +2 not known. 14 / 164

15 Multivariate Vector Autoregressive Model y 1,t = c 1 + γ 11 y 1,t 1 + γ 12 y 2,t 1 + γ 13 y 3,t 1 + e 1t y 2,t = c 2 + γ 21 y 1,t 1 + γ 22 y 2,t 1 + γ 23 y 3,t 1 + e 2t y 3,t = c 3 + γ 31 y 1,t 1 + γ 32 y 2,t 1 + γ 33 y 3,t 1 + e 3t Estimation: OLS equation by equation. VAR(1) model γ ji = effect of time series j on time series i We have q q = 9 autoregressive parameters γ ji, with q = 3 the number of time series. 15 / 164

16 Multivariate Prediction using VAR model T = 76, forecasts at horizon 1, 2, 3, and 4 sales growth of Beer + forecasts sales growth week 16 / 164

17 Big Data Section 3 Big Data 17 / 164

18 Big Data Many time series Predict next week sales of Beer using the observed Sales of Beer + + Price of Beer Marketing Effort for Beer Prices of other product categories Marketing Effort for other product categories Vector Autoregressive model Market Response Model 18 / 164

19 Big Data Market Response Model Sales, promotion and prices for 17 product categories: q = 17 3 = 51 time series and T = 77 weekly observations Time 19 / 164

20 Big Data VAR model for q = 3 17 = 51 time series q q = 2601 autoregressive parameters Explosion of number of parameters Use the LASSO instead of OLS. Question: Why? 20 / 164

21 Big Data Network Network with q nodes. Each node corresponds with a time series. draw an edge from node i to node j if γ ji 0 the edge width is the size of the effect the edge color is the sign of the effect (blue if positive, red if negative) 21 / 164

22 Big Data price effects on sales 17 product categories SNA CIG CRA SDR FRJ RFJ FEC BJC COO BER CHE TNA CSO FRE CER OAT FRD 22 / 164

23 Big Data 23 / 164

24 Part II Basic time series concepts 24 / 164

25 Outline 1 Stationarity 2 Autocorrelation 3 Differencing 4 AR and MA Models 5 MA-infinity representation 25 / 164

26 Stationarity Section 1 Stationarity 26 / 164

27 Stationarity Example: souvenirs sold (in dollars) Frequency: monthly, sample size: T = 84 souvenirtimeseries 0e+00 2e+04 4e+04 6e+04 8e+04 1e Time 27 / 164

28 Stationarity Stochastic Process A Stochastic Process is a sequence of stochastic variables:..., Y 1, Y 2, Y 3,..., Y T.... We observe the process from t = 1 to t = T, yielding a sequence of numbers which we call a time series. y 1, y 2, y 3,..., y T We only treat regularly spaced, discrete time series. Note that the observations in a time series are not independent! We need to rely on the concept of stationarity. 28 / 164

29 Stationarity Stationarity We say that a stochastic process is (weakly) stationary if 1 E[Y t ] is the same for all t 2 Var[Y t ] is the same for all t 3 Cov(Y t, Y t k ) is the same for all t, for every k > / 164

30 Autocorrelation Section 2 Autocorrelation 30 / 164

31 Autocorrelation Autocorrelations Then we define the autocorrelation of order k as ρ k = Corr(Y t, Y t k ) = Cov(Y t, Y t k ). Var[Y t ] The autocorrelations give insight in the dependency structure of the process. The autocorrelations can be estimated by ˆρ k = T t=k+1 (y t ȳ)(y t k ȳ) T t=1 (y t ȳ) / 164

32 Autocorrelation The correlogram A plot of ˆρ k versus k is called a correlogram. On a correlogram, we often see 2 lines, corresponding to the critical values of the test statistic T ˆρ k for testing H 0 : ρ k = 0 for a specific value of k. 32 / 164

33 Autocorrelation Example ACF The first 8 autocorrelations are significantly different from zero. there is strong persistency in the series. Lag 33 / 164

34 Differencing Section 3 Differencing 34 / 164

35 Differencing Difference Operators The Lag operator L is defined as Note that L s Y t = Y t s. LY t = Y t 1. The difference operator is defined as Y t = (I L)Y t = Y t Y t 1. Linear trends can be eliminated by applying once. If a stationary process is then obtained, we say that Y t is integrated of order / 164

36 y Differencing Example Random Walk with drift: Y t = a + Y t 1 + u t, with u t i.i.d. white noise. Plot of Y t and Y t : Time diff(y) Time 36 / 164

37 Differencing Correlograms of Y t and Y t : random walk with drift ACF Lag in differences ACF Lag 37 / 164

38 Differencing Seasonality Seasonal effects of order s can be eliminated by applying the difference operator of order s: s Y t = (I L s )Y t = Y t Y t s s = 12, monthly data s = 4, quarterly data Note than one loses s observations when differencing. 38 / 164

39 Differencing Example in R souvenir <- scan(" #declare and plot time series souvenirtimeseries <- ts(souvenir, frequency=12, start=c(1987,1)) y<-diff(diff(souvenirtimeseries,lag=12)) plot.ts(souvenirtimeseries) plot.ts(y) acf(y,plot=t) souvenirtimeseries 0e+00 2e+04 4e+04 6e+04 8e+04 1e Time 39 / 164

40 Differencing Trend and seasonally differenced series = y: y Time 40 / 164

41 Differencing Correlogram: Series y ACF Lag 41 / 164

42 AR and MA Models Section 4 AR and MA Models 42 / 164

43 AR and MA Models White Noise A white noise process is a sequence of i.i.d. observations with zero mean and variance σ 2 and we will denote it by u t. It is the building block of more complicated processes: For example, a random walk (without drift) model Y t is defined by Y t = u t, or Y t = Y t 1 + u t. u t is sometimes called the innovation process. It is not predictable. 43 / 164

44 AR and MA Models Why do we need models? To describe parsimoneously the dynamics of the time series. For forecasting. For example: Take a random walk model: Y t+1 = Y t + u t+1 for every t. Recall that T is the last observation, then for every forecast horizon h. Ŷ T +h = Y T, 44 / 164

45 AR and MA Models MA model A stationary stochastic process Y t is a moving average of order 1, MA(1), if it satisfies Y t = a + u t θu t 1, where a, and θ are unknown parameters. 45 / 164

46 AR and MA Models The autocorrelations of an MA(1) are given by ρ 0 = 1 ρ 1 = Corr(Y t, Y t 1 ) = ρ 2 = 0 ρ 3 = 0... θ (1+θ 2 ) 46 / 164

47 AR and MA Models The correlogram can be used to help us to specify an MA(1) proces: y Time ACF Lag 47 / 164

48 AR and MA Models A stationary stochastic process Y t is a moving average of order q, MA(q), if it satisfies Y t = a + u t θ 1 u t 1 θ 2 u t 2... θ q u t q, where a, and θ 1,..., θ q are unknown parameters. The autocorrelations of an MA(q) process are equal to zero for lags larger than q. If the correlogram shows a strong decline and becomes non significant after lag q, then there is evidence that the series was generated by an MA(q) process 48 / 164

49 AR and MA Models MA(3) y Time ACF / 164

50 AR and MA Models Estimation Using Maximum Likelihood (assuming normality of the innovations) mymodel<-arima(y,order=c(0,0,3)) mymodel Call: arima(x = y, order = c(0, 0, 3)) Coefficients: ma1 ma2 ma3 intercept s.e sigma^2 estimated as 0.765: log likelihood = , aic = / 164

51 AR and MA Models Validation The obtained residuals -after estimation - should be close to a white noise. It is good practice to make a correlogram of the residuals, in order to validate an MA(q) model. acf(mymodel$res,plot=t,main="residual correlogram") residual correlogram ACF / 164

52 AR and MA Models AR(1) model A stationary stochastic process Y t is an autoregressive of order 1, AR(1), if it satisfies Y t = a + φy t 1 + u t, where a, and φ are unknown parameters. The autocorrelations of an AR(1) are given by ρ 0 = 1 ρ 1 = Corr(Y t, Y t 1 ) = φ ρ 2 = φ 2 ρ 3 = φ / 164

53 AR and MA Models A stationary stochastic process Y t is an autoregressive of order p, AR(p), if it satisfies Y t = a + φ 1 Y t 1 + φ 2 Y t φ p Y t p + u t where a, and φ 1,..., φ p are unknown parameters. The autocorrelations tend more slowly to zero, and sometimes have a sinusoidal form. 53 / 164

54 AR and MA Models Estimation Using Maximum Likelihood > mymodel<-arima(y,order=c(2,0,0)) > mymodel Call: arima(x = y, order = c(2, 0, 0)) Coefficients: ar1 ar2 intercept s.e sigma^2 estimated as 1.073: log likelihood = , aic = > acf(mymodel$res,plot=t,main="residual correlogram") 54 / 164

55 AR and MA Models Residual Correlogram ACF Lag 55 / 164

56 MA-infinity representation Section 5 MA-infinity representation 56 / 164

57 MA-infinity representation Wold Representation Theorem If Y t is a stationary process, then it can be written as an MA( ): Y t = c + u t + + k=1 θ k u t k for any t. 57 / 164

58 MA-infinity representation Example: AR(1) Y t = a + φy t 1 + u t = a + φ (a + φy t 2 + u t 1 ) + u t = a(1 + φ) + u t + φu t 1 + φ 2 Y t 2 = a(1 + φ + φ ) + u t + φu t 1 + φ 2 u t Recognize an MA( ) with θ k = φ k for every k from 1 to + and the constant c = a/(1 φ.) 58 / 164

59 MA-infinity representation Impulse Response Function Given an impulse to u t of one unit, the response on Y t+k is given by θ k (see MA( ) representation ) Example: AR(1): k θ k coefficients of. Impulse Response If u t increases with 1 unit, then Y t increases with 1 Y t+1 increases with φ Y t+2 increases with φ 2 Y t+k increases with φ k. 59 / 164

60 MA-infinity representation reponse lag k 60 / 164

61 Part III Introduction to dynamic models 61 / 164

62 Outline 1 Example 2 Granger Causality 3 Vector Autoregressive Model 62 / 164

63 Example Section 1 Example 63 / 164

64 Example Example Variable to predict: Industrial Production Predictor: Consumption Prices ip Time cons Time 64 / 164

65 Example Running a regression lm(formula = ip ~ cons) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** cons e-12 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 39 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 1 and 39 DF, p-value: 1.022e / 164

66 Example Spurious regression We succeed in predicting 73.2% of the variance of Industrial Production. This high number is there because both time series are upward trending, and driven by time. Regression non-stationary time series on each other is called spurious regression. Standard Inference requires stationary time series. 66 / 164

67 Example Going in log-differences Y = log(industrial Production) X = log(consumption Prices) Y growth Time X infl Time 67 / 164

68 Example Note that log(x t ) = log(x t ) log(x t 1 ) = log(x t /X t 1 ) X t X t 1 X t 1. We get relative differences, or percentagewise increments. Y = log(industrial Production) = Growth X = log(consumption Prices) = Inflation 68 / 164

69 Example Running a regression lm(formula = growth ~ infl) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) infl e-07 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 38 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 1 and 38 DF, p-value: 5e / 164

70 Example The regression in log-differences is a regression on stationary variables, but: residual correlogram ACF OLS remains consistent to estimate β 0 and β 1, but use Newey-West Standard Errors. Lag 70 / 164

71 Granger Causality Section 2 Granger Causality 71 / 164

72 Granger Causality Granger Causality Econometrics/Statistics can never proof that a causal relationship between X and Y exists. Consider the equation Y t = α 0 + α 1 Y t α k Y t k + β 1 X t β k X t k + ε t. We say that X Granger causes Y if it provides incremental predictive power for predicting Y. Remark: select the lag k to have a valid model. 72 / 164

73 Granger Causality Test for no Granger Causality Y t = α 0 + α 1 Y t α k Y t k + β 1 X t β k X t k + ε t. Test using an F-statistics. H 0 : β 1 =... = β k = 0 If we reject H 0, then there is significant Granger Causality. 73 / 164

74 Granger Causality Example: Does inflation Granger Causes growth? R-code lag=2;t=length(infl) x=infl[(lag+1):t] x.1=infl[(lag):(t-1)] x.2=infl[(lag-1):(t-2)] y.1=growth[(lag):(t-1)] y.2=growth[(lag-1):(t-2)] model<-lm(y~y.1+y.2+x.1+x.2) model.small<-lm(y~y.1+y.2) anova(model,model.small) 74 / 164

75 Granger Causality Analysis of Variance Table Model 1: y ~ y.1 + y.2 + x.1 + x.2 Model 2: y ~ y.1 + y.2 Res.Df RSS Df Sum of Sq F Pr(>F) e-05 *** --- We strongly reject the hypothesis of no Granger Causality. Comment: Interchanging the roles of X and Y yields an F = 3.12(P = 0.056). Hence there is also some evidence for Granger Causality in the other direction. 75 / 164

76 Vector Autoregressive Model Section 3 Vector Autoregressive Model 76 / 164

77 Vector Autoregressive Model VAR(1) for 3 series x t = c 1 + a 11 x t 1 + a 12 y t 1 + a 13 z t 1 + u x,t y t = c 2 + a 21 x t 1 + a 22 y t 1 + a 23 z t 1 + u y,t z t = c 3 + a 31 x t 1 + a 32 y t 1 + a 33 z t 1 + u z,t. A VAR is estimated by OLS, equation by equation. The components of a VAR(p) do not follow AR(p) models The lag length p is selected using information criteria 77 / 164

78 Vector Autoregressive Model The error terms are serially uncorrelated, with covariance matrix Var(u x,t ) Cov(u x,t, u y,t ) Cov(u x,t, u z,t ) Cov( u t ) = Cov(u x,t, u y,t ) Var(u y,t ) Cov(u y,t, u z,t ). Cov(u x,t, u z,t ) Cov(u y,t, u z,t ) Var(u z,t ) We assume that u t is a multivariate white noise: E[ u t ] = 0 Cov( u t, u t k ) = 0 for k > 0 Cov( u t ) := Σ No correlation at leads and lags between components of u t ; only instantaneous correlation is allowed. 78 / 164

79 Vector Autoregressive Model Impulse-response functions: If component i of the innovation u t changes with one-unit, then component j of y t+k changes with (B k ) ji (other things equal). The function k (B k ) ji is called the impulse-response function. There are k 2 impulse response functions. [There exists many variants of the impulse response functions.] 79 / 164

80 Vector Autoregressive Model VAR example: inflation-growth > library(vars) > mydata<-cbind(infl,growth) > VARselect(mydata) $selection AIC(n) HQ(n) SC(n) FPE(n) $criteria AIC(n) e e e e+01 HQ(n) e e e e+01 SC(n) e e e e+01 FPE(n) e e e e AIC(n) e e e e+01 HQ(n) e e e e+01 SC(n) e e e e+01 FPE(n) e e e e AIC(n) e e+01 HQ(n) e e+01 SC(n) e e+01 FPE(n) e e / 164

81 Vector Autoregressive Model VAR example: inflation-growth > summary(mymodel) Estimation results for equation infl: ===================================== infl = infl.l1 + growth.l1 + infl.l2 + growth.l2 + infl.l3 + growth.l3 + const Estimate Std. Error t value Pr(> t ) infl.l * growth.l infl.l growth.l infl.l growth.l const ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 30 degrees of freedom Multiple R-Squared: 0.85,Adjusted R-squared: F-statistic: on 6 and 30 DF, p-value: 4.38e / 164

82 Vector Autoregressive Model VAR example: inflation-growth Estimation results for equation growth: ======================================= growth = infl.l1 + growth.l1 + infl.l2 + growth.l2 + infl.l3 + growth.l3 + const Estimate Std. Error t value Pr(> t ) infl.l growth.l *** infl.l growth.l e-06 *** infl.l growth.l *** const * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 30 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: on 6 and 30 DF, p-value: 1.513e / 164

83 Vector Autoregressive Model VAR example: prediction > predict(mymodel,n.ahead=6) $infl fcst lower upper CI [1,] [2,] [3,] [4,] [5,] [6,] $growth fcst lower upper CI [1,] [2,] [3,] [4,] [5,] [6,] / 164

84 Vector Autoregressive Model Example: impulse-response functions Impulse Response from infl growth infl % Bootstrap CI, 100 runs 84 / 164

85 Vector Autoregressive Model Impulse Response from growth growth infl % Bootstrap CI, 100 runs 85 / 164

86 Part IV Sparse Cointegration 86 / 164

87 Outline 1 Introduction 2 Penalized Estimation 3 Forecasting Applications 87 / 164

88 Introduction Section 1 Introduction 88 / 164

89 Introduction Interest Rate Example IR 1Y IR 3Y IR 5Y IR 7Y IR 10Y Time 89 / 164

90 Introduction Bivariate cointegration Consider two time series y 1,t and y 2,t, I (1). y 1,t and y 2,t are cointegrated if there exists a linear combination such that δ t is stationary. β 11 y 1,t + β 21 y 2,t = δ t β 11 y 1,t + β 21 y 2,t = δ t : Cointegration Equation β = (β 11, β 21 ) : Cointegrating vector 90 / 164

91 Introduction Bivariate cointegration (cont.) Vector Error Correcting Representation: [ y1,t y 2,t ] [ ] [ ] γ11,1 γ = 12,1 y 1,t γ 21,1 γ 22,1 y 2,t 1 [ ] [ γ11,p 1 γ 12,p 1 y 1,t p+1 γ 21,p 1 γ 22,p 1 y 2,t p+1 [ α11 α 21 ] [β11 β 21 ] + ] [ y 1,t 1 y 2,t 1 ] + [ ɛ1,t ɛ 2,t ]. Note: y 1,t = y 1,t y 1,t 1 91 / 164

92 Introduction Vector Error Correcting Model Let y t be a q-dimensional multivariate time series, I (1). Vector Error Correcting Representation: p 1 y t = Γ i y t i + Πy t 1 + ε t, i=1 t = 1,..., T where ε t follows N q (0, Σ), denote Ω = Σ 1 Γ 1,..., Γ p 1 q q matrices of short-run effects Π q q matrix. 92 / 164

93 Introduction Vector Error Correcting Model (cont.) p 1 y t = Γ i y t i + Πy t 1 + ε t. i=1 t = 1,..., T If Π = αβ, with α and β q r matrices of full column rank r (r < q) Then, β y t stationary y t cointegrated with cointegration rank r β: cointegrating vectors α: adjustment coefficients. 93 / 164

94 Introduction Maximum likelihood estimation (Johansen, 1996) Rewrite the VECM in matrix notation: where Y = ( y p+1,..., y T ) Y = Y L Γ + YΠ + E, Y L = ( X p+1,..., X T ) with X t = ( y t 1,..., y t p+1) Y = (y p,..., y T 1 ) Γ = (Γ 1,..., Γ p 1 ) E = (ε p+1,..., ε T ). 94 / 164

95 Introduction Maximum likelihood estimation (Johansen, 1996) (cont.) Negative log likelihood L(Γ, Π, Ω) = 1 T tr (( Y Y L Γ YΠ )Ω( Y Y L Γ YΠ ) ) log Ω. Maximum likelihood estimator: Problems: ( Γ, Π, Ω) = argmin Γ,Π,Ω L(Γ, Π, Ω), subject to Π = αβ. When T pq : ML estimator has low precision When T < pq : ML estimator does not exist 95 / 164

96 Penalized Estimation Section 2 Penalized Estimation 96 / 164

97 Penalized Estimation Penalized Regression Given standard regression model with β = (β 1,..., β p ). Penalized estimate of β: β = argmin β y i = x iβ + e i, n (y i x iβ) 2 + nλp(β), i=1 with λ a penalty parameter and P(β) a penalty function. 97 / 164

98 Penalized Estimation Penalized Regression (cont.) Choices of penalty functions: P(β) = p j=1 β j : Lasso - regularization and sparsity P(β) = p j=1 β2 j : Ridge - regularization / 164

99 Penalized Estimation Penalized ML estimation Penalized negative log likelihood L P (Γ, Π, Ω) = 1 T tr (( Y Y L Γ YΠ )Ω( Y Y L Γ YΠ ) ) log Ω with P 1, P 2 and P 3 three penalty functions. + λ 1 P 1 (β) + λ 2 P 2 (Γ) + λ 3 P 3 (Ω), Penalized maximum likelihood estimator: ( Γ, Π, Ω) = argmin Γ,Π,Ω L P (Γ, Π, Ω), subject to Π = αβ. 99 / 164

100 Penalized Estimation Algorithm Iterative procedure: 1 Solve for Π conditional on Γ, Ω 2 Solve for Γ conditional on Π, Ω 3 Solve for Ω conditional on Γ, Π 100 / 164

101 Penalized Estimation 1. Solving for Π conditional on Γ, Ω Solve ( ˆα, ˆβ) Γ, Ω = argmin α,β 1 ( T tr (G Yβα )Ω(G Yβα ) ) + λ 1 P 1 (β). subject to α Ωα = I r with G = Y Y L Γ Penalized reduced rank regression (e.g. Chen and Huang, 2012) 101 / 164

102 Penalized Estimation 1.1 Solving for α conditional on Γ, Ω, β Solve with ˆα Γ, Ω, β = argmin α G = Y Y L Γ B = Yβ Weighted Procrustes problem 1 ( ) T tr (G Bα )Ω(G Bα ) subject to α Ωα = I r, 102 / 164

103 Penalized Estimation 1.2 Solving for β conditional on Γ, Ω, α Solve ˆβ Γ, Ω, α = argmin β 1 ( ) T tr (R Yβ)(R Yβ) + λ 1 P 1 (β). with R = GΩα = ( Y Y L Γ)Ωα Penalized multivariate regression 103 / 164

104 Penalized Estimation Choice of penalty function Our choice: Lasso penalty: P 1 (β) = q i=1 r j=1 β ij Other penalty functions are possible: Adaptive Lasso: P 1 (β) = q r i=1 j=1 w ij β ij Weights : w ij = 1/ ˆβ initial ij / 164

105 Penalized Estimation 2. Solving for Γ conditional on Π, Ω Solve with Γ Π, Ω = argmin Γ D = Y YΠ 1 ( T tr (D Y L Γ)Ω(D Y L Γ) ) + λ 2 P 2 (Γ). Lasso penalty: P 2 (Γ) = q i=1 q j=1 p 1 k=1 γ ijk Penalized multivariate regression 105 / 164

106 Penalized Estimation 3. Solving for Ω conditional on Γ, Π Solve Ω Γ, Π = argmin Ω 1 ( T tr (D Y L Γ)Ω(D Y L Γ) ) log Ω +λ 3 P 3 (Ω). with D = Y YΠ Lasso penalty: P 3 (Ω) = k k Ω kk Penalized inverse covariance matrix estimation (Friedman et al., 2008) 106 / 164

107 Penalized Estimation Selection of tuning parameters λ 1 and λ 2 : Time series cross-validation (Hyndman, 2014) λ 3 : Bayesian Information Criterion 107 / 164

108 Penalized Estimation Time series cross-validation Denote the response by z t For t = S,..., T 1, (with S = 0.8T ) 1 Fit model to z 1,..., z t 2 Compute ê t+1 = z t+1 ẑ t+1 Select value of tuning parameter that minimizes with ê (i) t MMAFE = 1 T 1 1 T S q t=s the i th component of ê t q i=1 ê (i) t+1 ˆσ (i), ˆσ (i) = sd(z (i) t ). 108 / 164

109 Penalized Estimation Determination of cointegration rank Iterative procedure based on Rank Selection Criterion (Bunea et al., 2011): Set r start = q For ˆr = r start, obtain Γ using the sparse cointegrating algorithm Update ˆr to with Y = Y Y L Γ P = Y(Y Y) Y µ = 2S 2 (q + l) with l = rank(y) ˆr = max{r : λ r ( Y P Y) µ}, S 2 = Y P Y 2 F Tq lq Iterate until ˆr does not change in two successive iterations 109 / 164

110 Penalized Estimation Determination of cointegration rank (cont.) Properties of Rank Selection Criterion : Consistent estimate of rank(π) Low computational cost 110 / 164

111 Penalized Estimation Simulation Study VECM(1) with dimension q: y t = αβ y t 1 + Γ 1 y t 1 + e t, (t = 1,..., T ), where e t follows N q (0, I q ). Designs: 1 Low-dimensional design: T = 500, q = 4, r = 1 Sparse cointegrating vector Non-sparse cointegrating vector 2 High-dimensional design: T = 50, q = 11, r = 1 Sparse cointegrating vector Non-sparse cointegrating vector 111 / 164

112 Penalized Estimation Sparse Low-dimensional design Average angle between estimated and true cointegration space Average angle ML PML (sparse) adjustment coefficients a 112 / 164

113 Penalized Estimation Non-sparse Low-dimensional design Average angle ML PML (sparse) adjustment coefficients a 113 / 164

114 Penalized Estimation Sparse High-dimensional design Average angle ML PML (sparse) adjustment coefficients a 114 / 164

115 Penalized Estimation Non-sparse High-dimensional design Average angle ML PML (sparse) adjustment coefficients a 115 / 164

116 Forecasting Applications Section 3 Forecasting Applications 116 / 164

117 Forecasting Applications Rolling window forecast with window size S Estimate the VECM at t = S,..., T h for forecast horizon h. y t+h = p 1 i=1 Γ i y t+1 i + Πy t, Obtain h-step-ahead multivariate forecast errors ê t+h = y t+h y t+h. 117 / 164

118 Forecasting Applications Forecast error measures Multivariate Mean Absolute Forecast Error: MMAFE = T 1 h 1 T h S + 1 q t=s q i=1 y (i) (i) t+h y t+h, σ (i) where σ (i) is the standard deviation of the i th time series in differences. 118 / 164

119 Forecasting Applications Interest Rate Growth Forecasting Interest Data: q = 5 monthly US treasury bills Maturity: 1, 3, 5, 7 and 10 year Data range: January June 2015 Methods: PML (sparse) versus ML 119 / 164

120 Forecasting Applications Multivariate Mean Absolute Forecast Error Horizon h=1 MMAFE ML PML (sparse) S=48 S=96 S=144 Rolling window size 120 / 164

121 Forecasting Applications Consumption Growth Forecasting Data: q = 31 monthly consumption time series Total consumption and industry-specific consumption Data range: January 1999-April 2015 Forecast: Rolling window of size S = 144 Methods: Cointegration: PML (sparse), ML, Factor Model No cointegration: PML (sparse), ML, Factor Model, Bayesian, Bayesian Reduced Rank 121 / 164

122 Forecasting Applications Consumption Time Series Total consumption Time Household equipm Time Household appliances Time Recreational goods Video & Audio Photographic equipm Time Time Time Info processing equipm. Medical equipm. Telephone equipm Time Time Time 122 / 164

123 Forecasting Applications Multivariate Mean Absolute Forecast Error Method h = 1 h = 3 h = 6 h = 12 Cointegration PML (sparse) ML Factor Model No Cointegration PML (sparse) ML Factor Model Bayesian Bayesian Reduced Rank / 164

124 Forecasting Applications References 1 Wilms, I. and Croux, C. (2016), Forecasting using sparse cointegration, International Journal of Forecasting, 32(4), Bunea, F.; She, Y. and Wegkamp, M. (2011), Optimal selection of reduced rank estimators of high-dimensional matrices, The Annals of Statistics, 39, Chen, L. and Huang, J. (2012), Sparse reduced-rank regression for simultaneous dimension reduction and variable selection, Journal of the American Statistical Association,107, Friedman, J., Hastie, T. and Tibshirani, R. (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, Johansen (1996), Likelihood-based inference in cointegrated vector autoregressive models, Oxford: Oxford University Press. 6 Gelper, S., Wilms, I. and Croux, C. (2016), Identifying demand effects in a large network of product categories, Journal of Retailing, 92(1), / 164

125 Part V Robust Exponential Smoothing 125 / 164

126 Outline 1 Introduction 2 Exponential Smoothing 3 Robust approach 4 Simulations 5 ROBETS on Real Data 126 / 164

127 Introduction Section 1 Introduction 127 / 164

128 Introduction Goal Forecast many kinds of univariate time series Usually rather short time series / 164

129 Introduction 0e+00 4e+05 8e Monthly number of unemployed persons in Australia, from February 1978 till August / 164

130 Introduction A quarterly microeconometric time series Additive or Multipicative noise? No trend, damped trend or trend? No seasonality? Additive or multiplicative seasonality? 130 / 164

131 Introduction model <- ets(y) # Hyndman and Khandakar (2008) plot(forecast(model, h = 120)) Forecasts from ETS(M,Ad,M) 0e+00 4e+05 8e / 164

132 Introduction model1 <- ets(y) # Hyndman and Khandakar (2008) model2 <- robets(y) # our procedure plot(forecast(model1, h = 8)) # first plot plot(forecast(model2, h = 8)) # second plot Forecasts from ETS(M,N,N) Forecasts from ROBETS(M,A,M) / 164

133 Exponential Smoothing Section 2 Exponential Smoothing 133 / 164

134 Exponential Smoothing Simple exponential smoothing ŷ t+h t = l t l t = l t 1 + α(y t l t 1 ) y t : univariate time series ŷ t+h t : h-step ahead prediction l t : level α: smoothing parameter; in [0,1] 134 / 164

135

136 Exponential Smoothing Example 1: additive damped trend without seasonality Model: AA d N / MA d N ŷ t+h t = l t + h φ j b t j=1 l t = αy t + (1 α)(l t 1 + φb t 1 ) b t = β(l t l t 1 ) + (1 β)φb t 1 l t : the level b t : the (damped) trend α, β: smoothing parameters φ: damping parameter 136 / 164

137 Exponential Smoothing Births per 10,000 of 23 year old women, USA y No trend (φ = 0, ANN-model), Full trend (φ = 1, AAN-model) damped trend (AA d N) 137 / 164

138 Exponential Smoothing Example 2: additive damped trend and multiplicative seasonality ( MA d M ) ŷ t+h t = (l t + h φ j b t )s t+h + m m j=1 l t = α yt s t m + (1 α)(l t 1 + φb t 1 ) b t = β(l t l t 1 ) + (1 β)φb t 1 s t = y γ t l t 1 +φb t 1 + (1 γ)s t m. l t : the level b t : the (damped) trend s t : the seasonal components α, β, γ: smoothing parameters φ: damping parameter h + m = (h 1) mod m + 1 m: number of seasons per period 138 / 164

139 Exponential Smoothing Monthly number of unemployed persons in Australia, from February 1978 till August 1995 y 0e+00 4e+05 8e No trend (φ = 0, MNM-model), Full trend (φ = 1, MAM-model) damped trend (MA d M) 139 / 164

140 Exponential Smoothing How ets and robets work ANN AAN AA d N MA d M α α, β estimate parameters α, β, φ α, β, φ, γ compute AICC AICC AICC AICC AICC select model with lowest AICC MA d M AICC: selection criterion that penalizes models with a lot of parameters 140 / 164

141 Robust approach Section 3 Robust approach 141 / 164

142 Robust approach ETS is not robust Model: AA d A / MA d A ŷ t+h t = l t + h φ j b t + s t m+h + m j=1 l t = α( y t s t m ) + (1 α)(l t 1 + φb t 1 ) b t = β(l t l t 1 ) + (1 β)φb t 1 s t = γ( y t l t 1 φb t 1 ) + (1 γ)s t m. linear dependence of y t on all forecasts ŷ t+h t (for this variant, but also for all other variants) 142 / 164

143 Robust approach How to make it robust? ŷ t+h t = l t + h φ j b t + s t m+h + m j=1 l t = α( y t s t m ) + (1 α)(l t 1 + φb t 1 ) b t = β(l t l t 1 ) + (1 β)φb t 1 s t = γ( y t l t 1 φb t 1 ) + (1 γ)s t m. replace y t by yt, resulting in forecasts ŷ t+h t. [ ] yt yt ŷ t t 1 = ψ ˆσ t + ŷ t t 1 ˆσ t with ψ, the Huber function and ˆσ t an online scale estimator 143 / 164

144 Robust approach Cleaning the time series y t [ ] yt ŷ t t 1 = ψ ˆσ t + ŷ t t 1 ˆσ t Huber ψ sapply(x, psi.huber) / 164

145 Robust approach Scale estimator ( yt ŷ ) ˆσ t 2 t t 1 = 0.1 ρ ˆσ t ˆσ t 1 2 ˆσ t 1 with ρ Biweight function: sapply(x, rhobiweight) / 164

146 Robust approach Estimating parameters Parameter vector θ: α, β, γ, φ. l 0, b 0, s m+1,..., s 0 are estimated in a short startup period. 146 / 164

147 Robust approach Estimating parameters Hyndman and Khandakar (2008): maximum likelihood (MLE): additive error model ˆθ = argmax θ multiplicative error model y t = ŷ t t 1 + ɛ t l t = ŷ t t 1 + αɛ t T 2 log 1 T T ( yt ŷ t t 1 (θ) ) 2 t=1 y t = ŷ t t 1 (1 + ɛ t ) l t = ŷ t t 1 (1 + αɛ t ). ˆθ = argmax θ T 2 log 1 T T ( ) yt ŷ t t 1 (θ) 2 T log ŷt t 1 (θ) ŷ t t 1 (θ) t=1 t=1 147 / 164

148 Robust approach Robust estimation of parameters Replace mean sum of squares by τ 2 -scale for additive error model: ˆθ = argmax roblik A (θ) θ roblik(θ) = T 2 log s2 T (θ) T ( ) yt ŷ t t 1 (θ) ρ T s T (θ) t=1 s T (θ) = med y t ŷ t t 1 (θ) t 148 / 164

149 Robust approach for multiplicative error model: ˆθ = argmin θ s 2 T (θ) T T t=1 ( ) yt ŷ t t 1 (θ) ρ ŷ t t 1 (θ)s T (θ) roblik(θ) = T 2 log ( s 2 T (θ) T ( T yt ŷt t 1 ρ (θ) )) T log ŷ ŷ t t 1 (θ)s T (θ) t t 1 (θ) t=1 s T (θ) = med t y t ŷ t t 1 (θ) ŷ t t 1 (θ) t=1 149 / 164

150 Robust approach Model selection pt robust AICC = 2 rob lik + 2 T p 1 p: number of parameters to be estimated 150 / 164

151 Simulations Section 4 Simulations 151 / 164

152 Simulations Simulations Generate time series of length T + 1 = 61 from models of different types Add contamination Use the non-robust method (C) and the robust method (R) to predict value at T + 1 from first T observations. Compute over 500 simulation runs RMSE = (y 61,i ŷ 61 60,i ) i=1 152 / 164

153 Simulations y y t t Figure: Left: clean simulation, right: outlier contaminated 4 seasons, 15 years T = / 164

154 Simulations RMSE: Known model, unknown parameters generating no outliers outliers model C R C R ANN ANA AAN AA d N MNN MNA MAN MAA MA d N MNM MAM / 164

155 Simulations RMSE: Unknown model, unknown parameters generating no outliers outliers model C R C R ANN ANA AAN AAA AA d N MNN MNA MAN MAA MA d N MNM MAM slightly larger error 155 / 164

156 ROBETS on Real Data Section 5 ROBETS on Real Data 156 / 164

157 ROBETS on Real Data Real data 3003 time series from M3-competition. Yearly, quarterly and monthly. Microeconometrics, macroeconometrics, demographics, finance and industry. Length of series: from 16 to 144. Length out-of-sample period: for yearly: 6, quarterly 8 and monthly / 164

158 ROBETS on Real Data Median Absolute Percentage Error For every h = 1,..., 18: MedAPE h = 100% median y ti +h,i ŷ ti +h t i,i y ti +h,i t i : length of estimation period of time series i. 158 / 164

159 ROBETS on Real Data Method Forecasting horizon h ets robets Table: The median absolute percentage error for all data. Sometimes ets is better, sometimes robets is better 159 / 164

160 ROBETS on Real Data Forecasts from ETS(M,N,N) Forecasts from ROBETS(M,A,M) A quarterly microeconometric time series where robets has better out-of-sample forecasts 160 / 164

161 ROBETS on Real Data Forecasts from ETS(M,N,N) Forecasts from ROBETS(M,A,M) Outlier detection. 161 / 164

162 ROBETS on Real Data Computing time 4 seasons, including model selection. time series length non-robust method robets Table: Average computation time in milliseconds. 162 / 164

163 ROBETS on Real Data Exercise Data: Type in the time series: values <- c(27, 27, 7, 24, 39, 40, 24, 45, 36, 37, 31, 47, 16, 24, 6, 21, 35, 36, 21, 40, 32, 33, 27, 42, 14, 21, 5, 19, 31, 32, 19, 36, 29, 29, 24, 42, 15, 24, 21) Create a time series object: valuests <- ts(values, freq = 12). Plot the series. Are there outliers in the time series? Install and load the forecast and the robets package Estimate two models: mod1 <- ets(valuests) and mod2 <- robets(valuests)and look at the output. Make a forecast with both models: plot(forecast(mod, h = 24)) Compare both forecasts, and use plotoutliers(mod2) to detect outliers. 163 / 164

164 ROBETS on Real Data References Crevits R. and Croux C. (2016), Forecasting Time Series with Robust Exponential Smoothing with Damped Trend and Seasonal Components, [1] Gelper, S., Fried, R. & Croux, C. (2010). Robust Forecasting with Exponential and Holt-Winters Smoothing Journal of Forecasting, 29, [2] Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The Forecast Package for R. Journal of Statistical Software, 27, / 164

Forecasting using robust exponential smoothing with damped trend and seasonal components

Forecasting using robust exponential smoothing with damped trend and seasonal components Crevits R, Croux C. KBI_1714 Forecasting using Robust Exponential Smoothing with Damped Trend and Seasonal Components