Focused Information Criteria for Time Series

Focused Information Criteria for Time Series Gudmund Horn Hermansen with Nils Lid Hjort University of Oslo May 10, 2015 1/22

Do we really need more model selection criteria? There is already a wide range of criteria, e.g. AIC, AIC C, BIC, TIC, FPR, HQ, etc. The underlying motivations are not particularly well known among practitioners. Example: For stationary time series, there are two versions based on similar reasoning of the AIC, i.e. for model M we have and AIC n (M) = 2 log-likelihood max (M) 2p AIC (M) = 2 Whittle-log-likelihood max (M) 2p + 2q, where p = dim(m) and q has to be estimated, see Hermansen and Hjort (2015). There are (at least) three good selling points for the FIC: (1) allows for problem specific focus (2) clear and simple motivation (minimising estimated mse) (3) in principle as easy to use as the AIC (e.g. no tuning parameters) 2/22

The focused information criterion For model M let µ M be a focus parameter as a function of M, e.g. quantile, threshold probability, covariance lag, etc. It is important the µ M has the same interpretation across models. Let µ M be estimated by the plug-in principle for each M, e.g. in the case where M is specified by θ M R p we have µ M = µ( θ M ). The goal is to find best model/estimator for µ with respect to mse( µ M ) = (bias( µ M )) 2 + Var( µ M ) = sqb( µ M ) + Var( µ M ). Model selection strategy (1) Obtain a reasonable estimator for the mse and use FIC(µ, M) = mse( µ M ) = ŝqb( µ M ) + Var( µ M ). (2) Choose the model and estimator with the smallest estimated mse. 3/22

The focused information criterion Here, we take the common large-sample approximations approach to obtain general estimators for the mse. This work extends Claeskens and Hjort (2003). A parametric approach, where all models are nested between a wide specified by (θ, γ) and a narrow model with (θ, γ 0 ) and γ 0 known. The true generating model for Y is parametrised by (θ 0, γ 0 + δ/ n). The mse( µ M ) is based on an (unbiased) estimate for the mse of n( µm µ true ) in the limit experiment, where µ true is µ evaluated at the truth. The extension has certain time series specific challenges, e.g. - we would like to include predictions - and data-dependent foci like µ M (Y 1,... Y n ) = Pr{Y n+1 > a and Y n+2 > a Y 1,... Y n }, for a certain constant a. 4/22

Model and assumptions Let Y t = m β (x t ) + ɛ t, where ɛ t is a stationary Gaussian time series with E ɛ t = 0 and spectral density f η and x t are covariate vectors. Following Claeskens and Hjort (2003) the true model is specified by and m true (x t ) = m(x t ; β 0, γ 0,1 + δ 1 / n) f true (ω) = f(ω; ν 0, γ 0,2 + δ 2 / n) where θ 0 = (β 0, ν 0 ) R p1+p2 and γ 0 = (γ 0,1, γ 0,2 ) R q1+q2. In addition, we need (essentially) that 1 n [ m 0(x t )] t Σ(f 0 ) 1 [ m 0 (x t )] M exists with m 0 (x t ) = m(x t ; β 0, γ 0,1 )/ β and Σ(f 0 ) being the associated covariance matrix, and where f 0 (ω) = f(ω; ν 0, γ 0,2 ) has continuous and bounded second order derivatives. 5/22

Model and assumptions This setup allows for misspecification in both trend and dependency. The wide model has p + q = (p 1 + p 2 ) + (q 1 + q 2 ) parameters. The candidate models are nested between the wide model and the narrow model, where γ = γ 0. A total of 2 q1+q2 possible models obtained by including/excluding elements of γ 0. We only consider those that are judged as sufficiently plausible. There are few papers dealing with the FIC related topics for time series models, see e.g. Claeskens et al. (2007). The derived results are also valid for locally-stationary process of Dahlhaus (1997). Example: Suppose Y t = 0 + ɛ t and that we are interested in a certain (important) covariance lag h, then µ M (h) = cov fν (Y t, Y t+h ) = π π cos(ωh)f ν (ω) dω. 6/22

Okay, but does it really work? Simple simulation study with an AR(4) model focusing on various µ M (h) = cov fν (Y t, Y t+h ). The true model has σ = 1, ρ = (0.4, 0.4, 0.4, 0.2), and the figure is based on 50 simulated series of length n = 50. 7/22

The focused information criterion for time series Suppose µ = µ(θ), where θ = (β, ν), only depends on the model parameters. Then, for each submodel M we obtain a general argument for n( µm µ true ) d Λ M, where Λ M has a certain multivariate normal distribution. From this an unbiased estimator for mse(λ M ) is constructed via mse(λ M ) = ŝqb(λ M ) + Var(Λ M ) A common challenge is that ŝqb(λ M ) is itself biased and should be corrected, resulting in a robust mse(λ M ) = max{0, ŝqb(λ M ) bias(ŝqb(λ M ))} + Var(Λ M ) The FIC strategy is to use mse(λ M ) to approximate the mse for each submodel M. Here, compared to Claeskens and Hjort (2003) the general structure, arguments and formulas are quite similar. 8/22

The focused information criterion for time series Under the Gaussian assumption the parameters related to trend and dependency are independent in the limit. The traditional (non-robust) FIC can therefore be expressed as FIC(µ, M) = σ 2 narrow + 2( σ 2 M f + σ 2 M m ) + ( ψ wide ψ Mf ψ Mm ) 2, where the σ are related to the variance and the ψ to the bias. If µ is independent of either trend or dependency, e.g. µ = m β (1) m β (0) or µ = C(0) the FIC-scores are indifferent to changes in the excluded direction. This suggests to either detrend prior to the analysis or that scores should be estimated under the respective candidate model. Also, the formulas involved simplify if: - m β (x t ) = β - m β (x t ) = x t tβ and x t are from a well-behaved distribution - if x t is smooth in t - Y t is a locally-stationary process (cf. Dahlhaus (1997)) 9/22

What makes focus functions data-dependent? Some foci are more interesting in a conditional framework. Illustration: Consider the data-independent threshold probability or data-dependent µ = Pr{Y n+1 > a and Y n+2 > a} µ(h m ) = Pr{Y n+1 > a and Y n+2 > a H m } for suitable constant a, and with (recent) history H m = (Y n m+1,..., Y n ). In principle we could have H m with m = n. In practice, quite often m is independent of n and m n. Example: For AR(q) processes it will often be sufficient with m = q. Short-memory series with H m and m = m n should be effectively approximated in a fixed and recent history framework. 10/22

What about predictions? Predictions are essentially data-dependent focus functions. To easily see why, - let F k = (Y n+1,..., Y n+k ) represent the near future and - suppose we intend to predict g(f k ), e.g. for one-step ahead predictions g(f k ) = Y n+1, - and that µ M (H m ) is a predictor for g(f m ) - then mse( µ M (H m )) = E { µ M (H m ) E[g(F k ) H m ] + E[g(F k ) H m ] g(f k )} 2 = E { µ M (H m ) E[g(F k ) H m ]} 2 + 0 + E {E[g(F k ) H m ] g(h m )} 2 - with the conclusion that a good predictor for g(f k ) is equivalent (in term of mse) to a good estimator for µ true (H m ) = E[g(F k ) H m ]. 11/22

Why this interest with recent history? A data-dependent focus begs the question of mse( µ M (H m )) or mse( µ M (H m ) H m ). Use the one that best represents what is important. Does not necessarily make sense if m = n, since mse( µ M (H m ) H m ) = mse( µ M (H n ) H n ) = 0 for all unbiased submodels estimators. If m is independent of n and m n it makes sense to introduce cfic(µ, M, H m ) = mse( µ M (H m ) H m ). If the large-sample arguments hold conditionally and things are independent of H m it the limit, then: - the familiar FIC formulas remain largely unchanged (everything involving µ do now depend on H m ) - and should be interpreted in relation to conditional mse 12/22

Why this interest with recent history? A key step of the FIC argument depends on the delta method, i.e. n( µm (H m ) µ true (H m )) = n(µ( θ M, γ M, γ 0,M c, H m ) µ true (H m )). = µ(θ 0, γ 0 + δ/ n, H m ) t Z n where Z n = depends on H m through ( θ, γ M ). ( ) n( θ θ0 ) n( γm γ 0,M ) In the conditional framework µ(θ 0, γ 0 + δ/ n, H m ) is not random anymore, which simplifies the arguments needed. And if Z n H m d Z, with Z independent of H m, we have justified our cfic(µ, M, H m ) = mse( µ M (H m ) H m ). Is the conditional convergence of Z n generally true? Again, this will not work if m = n and H m = H n = (Y 1,..., Y n ). 13/22

Now, how does this play out unconditionally? It is much harder to find a simple limit experiment such that unconditionally. n( µ M (H m ) µ true (H m )) Λ M (H m ) pr 0 However, following the general idea mse(λ M (H M )) mse( n( µ M (H m ) µ true (H m ))) = E {mse( n( µ M (H m ) µ true (H m )) H m )}. E { mse(λ M (H m ) H m )} A quick (and dirty) solution resulting in explicit (but quite messy) formulas is to use FIC(µ, M, H m ) = Ê{mse( µ(h m) H m )}. 14/22

Illustration: The Hjort liver quality index (1859 2012) For a individual fish HSI fish = 100 weight of liver weight of fish. The HSI is a measure of the quality of life and is e.g. related to reproduction. Understanding the dynamic of the HSI index to e.g. external factors. 15/22

Illustration: The Hjort liver quality index (1859 2012) Predicting the future liver quality index. The model we consider is HSI yeari = β 0 + β 1 year i + x t iβ + ɛ i, where the intercept β 0, and σ are protected and x i contains winter Kola temperature, mortality rate (F) and food availability (capelin). 16/22

Illustration: The Hjort liver quality index (1859 2012) Other foci we looked at was relative slope and the probability of two lean years in a row. More details and discussion can be found in Hermansen et al. (2015) 17/22

Illustration: Prototype FIC R-package Simple threshold probability simulation experiment. True model is Y t = 1 + 2(t/n) + ɛ t, where ɛ t is an AR(4) prosess with ρ = (0.3, 0.2, 0.1, 0.1) and σ = 1, and n = 50. Focus parameter is and µ 1 (H m ) = 0.48. µ 1 (H m ) = Pr{Y n+1 > 0 H m }, m f mu mse bias sd psi tau.sq fic fic.b aic bic p fir.r fic.b.r aic.r bic.r 1 1 1111 0.56 0.97 0.00 0.99 1.70 0.820 1.60 1.6-258.98-277.22 7 5 5 5 8 2 1 1110 0.52 0.44-0.46 0.81 2.00 0.500 1.10 1.3-257.30-272.93 6 3 3 3 4 3 1 1100 0.53 0.43-0.47 0.81 2.00 0.490 1.10 1.3-255.31-268.34 5 2 2 1 2 4 1 1000 0.64 0.74 0.45 0.73 0.86 0.380 1.40 1.4-263.35-273.77 4 4 4 7 6 5 1 0000 0.78 7.30 2.70 0.50-1.10 0.096 8.00 8.0-273.17-280.98 3 10 10 9 9 6 0 1111 0.41 2.10 1.10 0.94 2.80 0.720 2.70 2.7-259.35-274.98 6 6 6 6 7 7 0 1110 0.40 2.20 1.30 0.75 3.10 0.400 2.90 2.9-257.44-270.47 5 9 9 4 3 8 0 1100 0.40 2.20 1.30 0.75 3.10 0.400 2.90 2.9-255.54-265.96 4 8 8 2 1 9 0 1000 0.48 0.01-0.65 0.66 2.00 0.280 0.67 1.1-265.65-273.46 3 1 1 8 5 10 0 0000 0.61 2.10 1.40 0.40 0.00 0.000 2.80 2.8-279.65-284.87 2 7 7 10 10 tau.null = 0.4 psi.wide = 1.7 18/22

Illustration: Prototype FIC R-package Simple threshold probability simulation experiment. True model is Y t = 1 + 2(t/n) + ɛ t, where ɛ t is an AR(4) prosess with ρ = (0.3, 0.2, 0.1, 0.1) and σ = 1, and n = 50. Focus parameter is and µ 2 (H m ) = 0.35. µ 2 (H m ) = Pr{Y n+1 > 0 and Y n+2 > 0 H m }, m f mu mse bias sd psi tau.sq fic fic.b aic bic p fir.r fic.b.r aic.r bic.r 1 1 1111 0.39 0.98 0.000 0.99 1.400 0.74 1.50 1.50-258.98-277.22 7 4 4 5 8 2 1 1110 0.38 0.98-0.037 0.99 1.400 0.74 1.50 1.50-257.30-272.93 6 3 3 3 4 3 1 1100 0.38 0.39-0.550 0.83 1.400 0.44 0.89 1.20-255.31-268.34 5 2 2 1 2 4 1 1000 0.52 1.80 1.100 0.72 0.025 0.28 2.30 2.30-263.35-273.77 4 6 6 7 6 5 1 0000 0.61 7.20 2.600 0.62-1.400 0.14 7.70 7.70-273.17-280.98 3 10 10 9 9 6 0 1111 0.24 2.60 1.300 0.92 2.700 0.60 3.10 3.10-259.35-274.98 6 8 8 6 7 7 0 1110 0.24 2.70 1.300 0.92 2.800 0.60 3.20 3.20-257.44-270.47 5 9 9 4 3 8 0 1100 0.25 2.00 1.200 0.74 2.700 0.30 2.50 2.50-255.54-265.96 4 7 7 2 1 9 0 1000 0.33-0.22-0.780 0.62 1.400 0.14 0.28 0.88-265.65-273.46 3 1 1 8 5 10 0 0000 0.37 1.30 1.100 0.49 0.000 0.00 1.80 1.80-279.65-284.87 2 5 5 10 10 tau.null = 0.49 psi.wide = 1.4 19/22

Illustration: Prototype FIC R-package Simple threshold probability simulation experiment. True model is Y t = 1 + 2(t/n) + ɛ t, where ɛ t is an AR(4) prosess with ρ = (0.3, 0.2, 0.1, 0.1) and σ = 1, and n = 50. Focus parameter is and µ 10 (H m ) = 0.79. µ 10 (H m ) = Pr{Y n+10 > 0 H m }, m f mu mse bias sd psi tau.sq fic fic.b aic bic p fir.r fic.b.r aic.r bic.r 1 1 1111 0.78 0.52 0.0000 0.72-0.84 0.200 0.39 0.39-258.98-277.22 7 3 3 5 8 2 1 1110 0.77 0.52-0.0032 0.72-0.83 0.200 0.39 0.39-257.30-272.93 6 2 2 3 4 3 1 1100 0.77 0.52-0.0096 0.72-0.83 0.200 0.39 0.39-255.31-268.34 5 1 1 1 2 4 1 1000 0.79 0.58 0.2400 0.72-1.10 0.190 0.44 0.44-263.35-273.77 4 4 4 7 6 5 1 0000 0.78 0.96 0.6800 0.71-1.50 0.180 0.83 0.83-273.17-280.98 3 6 6 9 9 6 0 1111 0.59 2.50 1.5000 0.59 0.69 0.020 2.40 2.40-259.35-274.98 6 8 8 6 7 7 0 1110 0.59 2.50 1.5000 0.59 0.69 0.020 2.40 2.40-257.44-270.47 5 10 10 4 3 8 0 1100 0.59 2.50 1.5000 0.59 0.69 0.020 2.40 2.40-255.54-265.96 4 9 9 2 1 9 0 1000 0.60 1.80 1.2000 0.58 0.43 0.013 1.60 1.60-265.65-273.46 3 7 7 8 5 10 0 0000 0.61 0.83 0.7100 0.57 0.00 0.000 0.70 0.70-279.65-284.87 2 5 5 10 10 tau.null = 0.57 psi.wide = -0.84 20/22

Concluding remarks Some work is still needed before completion. An R-package is planned/under development. Simulation study. Model averaging and AFIC. There are no good model selection tools for the locally-stationary processes of Dahlhaus (1997). The FIC based on the Whittle approximation (cf. Whittle (1953)) is equally rational motivation. The methodology is valid for models with known change point locations. Seasonality. Nonparametric (focused) covariance estimator. 21/22

References Claeskens, G., Croux, C., and Van Kerckhoven, J. (2007). Prediction focussed model selection for autoregressive models. Australian & New Zealand Journal of Statistics, 49(4):359 379. Claeskens, G. and Hjort, N. L. (2003). The focused information criterion. Journal of the American Statistical Association, 98:900 916. Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Annals of Statistics, 15:1 37. Hermansen, G. and Hjort, N. L. (2015). A new approach to Akaike s information criterion and model selection issues in stationary Gaussian time series. Technical report, University of Oslo and Norwegian Computing Centre. Hermansen, G. H., Hjort, N. L., Kjesbu, O. S., and Tara Marshall, C. (2015). Recent advances in statistical methodology applied to the hjort liver index time series (1859 2012) and associated influential factors 1. Canadian Journal of Fisheries and Aquatic Sciences, 73(999):1 17. Whittle, P. (1953). The analysis of multiple stationary time series. Journal of the Royal Statistical Society. Series B (Methodological), 15(1):125 139. 22/22