A time series is called strictly stationary if the joint distribution of every collection (Y t

5 Time series A time series is a set of observations recorded over time. You can think for example at the GDP of a country over the years (or quarters) or the hourly measurements of temperature over a month. Time series present a challenge for the statistical analysis because of the obvious correlation introduced by the sampling of adjacent points in time. For ease of exposition, we assume now that the time series is observed over a discrete and equally spaced set of times T = {t 1,..., t n } and Y t is the random variable that generates the observation y t at time t. The objective of time series analysis is to understand the process that generates the sample time series and to predict (forecast) the response variable at future times. A complete description of the time series {Y t, t T } (intended as the random process generating the data), is given by the joint distribution function of (Y 1,..., Y tn ). In practice, some modelling assumptions on the form of this joint distribution will be needed. An important object in time series analysis is the autocovariance function γ Y (s, t) = cov(y s, Y t ) = E[(Y s E[Y s ])(Y t E[Y t ])]. The autocovariance measures the linear dependence between the response at two different times. It is often more convenient to consider instead the autocorrelation function (ACF) ρ Y (s, t) = γ Y (s, t) (γy (s, s)γ Y (t, t), which is the correlation between Y s and Y t. Note that this quantity is well defined only if both Y S and Y t have finite variance. 5.1 Stationary time series Without any assumptions on the process generating the time series, it would be impossible to carry out any statistical analysis, because we observed only one replicate from the random vector (Y 1,..., Y tn ). However, in many applications there is some regularity (or smoothness) in the underlying process, thus allowing us to borrow information across the time series to investigate the characteristics of the process. An important class of these processes are time series that are stationary. A time series is called strictly stationary if the joint distribution of every collection (Y t 1,..., Y t k ) is equal to the joint distribution of the time shifted set (Y t 1 +h,..., Y t k +h), for any h Z such that t 1 +h t 1 and t k + h t n. 1

This definition implies that, for a strictly stationary series, Y s and Y t are identically distributed for any s, t. Moreover, if the means E[Y s ] and E[Y t ] exist, they are the same and the autocovariance function γ Y (s, t) = γ Y (s + h, t + h) (check this!). Therefore the covariance between two time points depends only on their time shift h. Since checking strictly stationarity for a time series is very difficult and it is often a too strong assumption, we formulate a weaker definition. A time series is called weakly stationary (or simply stationary) if Y t has finite variance for all t and 1. the mean E[Y t ] is constant for all t and 2. the autocovariance function γ Y (s, t) depends on s and t only through their difference h = s t. Strictly stationarity implies (weak) stationarity but the opposite is not true. We may also simplify the notation for the autocovariance function, since it depends only on the time shift (or lag) h. Let s = t + h, then γ Y (t + h, t) = cov(y t+h, Y t ) = cov(y t, Y 0 ) = γ Y (t, 0). = γ Y (t) with a little abuse of notation. Example 5.1. (White noise) A simple kind of generating process for the time series is a collection of uncorrelated random variables W t, all with mean 0 and finite variance σ 2. In engineering, this process is usually called white noise. The white noise is a (weakly) stationary process, because V ar(w t ) = σ 2, +, E[W t ] = 0 t and { σ 2 h = 0 γ W (s, t) = γ W (h) = 0 h 0 If W t are Gaussian random variables for all t (Gaussian white noise), then the series is also strictly stationary. Example 5.2. (Signal plus noise) In some cases, data can be modelled as an underlying deterministic signal corrupted by additive noise, i.e. Y t = cos(2πt/10) + W t, with t = 1, 2,..., 100 and W t is the white noise described in Example 5.1. In this case Y t is not stationary, because E[Y t ] = cos(2πt/10) and it is not constant in time. If the time series is stationary, we can estimate the constant mean µ = E[Y t ] with the sample mean, as t n µ = 1 Y t, n t=t 1 2

which is an unbiased estimator for µ, and the autocovariance function at leg h as t n h γ Y (h) = 1 (Y t+h µ)(y t µ), n t=t 1 with γ Y ( h) = γ Y (h) and h = 0, 1,..., n 1. Note that the larger the lag, the less observations we have available to estimate the autocovariance. The sample autocorrelation function is then defined as ρ Y (h) = γ Y (h) γ Y (0). Remark 5.3. When the time series is a white noise, the sample ACF ρ W (h) is asymptotically normal with zero mean and variance 1 n, for h = 1, 2,..., H. This is useful to test if the time series at hand (or the residuals after some modelling) is indeed generated by a white noise. 5.1.1 AR models In classical linear models, the response variable is influenced only by the correspondent independent variables (predictors) (plus independent errors). In the time series case, it may be needed to allow the response variable at time t to depend also on its past values. Autoregressive (AR) models are based on the idea that the current value of the time series Y t can be expressed as a function of p past values Y t 1,..., Y t p. An autoregressive model of order p, also called AR(p), is of the form Y t = α + p β h Y t h + W t, h=1 where α, β h, h = 1,..., p are unknown coefficients, W t is a white noise process of variance σ 2 and Y t is a stationary process. Let us consider for simplicity a AR(1) process with α = 0, Y t = β 1 Y t 1 + W t, 3

and investigate its properties. We can rewrite it as Y t =β 1 Y t 1 + W t = β 1 (β 1 Y t 2 + W t ) + W t =β 2 1Y t 2 + β 1 W t 1 + W t. k 1 =β1 k Y t k + β j 1 W t j. j=1 We can therefore represent the AR(1) process as Y t = + β j 1 W t j with the left hand side well-defined when β 1 < 1. If β 1 1, it is possible to write an equivalent expression with a combination of future values of the white noise. For this reason, the case β 1 1 is called non-causal and it is not usually relevant for real life time series. When β 1 < 1, the AR(1) process is causal with mean E[Y t ] = and autocovariance function + β j 1 E[W t j] = 0 + + γ Y (h) = cov(y t, Y t+h ) = E[( β j 1 W t j)( β1 k W t+h k )] = = σ 2 + β j 1 βh+j 1 = σ 2 β h 1 + k=0 β 2j 1 = σ2 β1 h 1 β1 2. As a consequence, the ACF for a AR(1) process is ρ Y (h) = γ Y (h)/γ Y (0) = β1 h, for h 0. We can then compare the sample ACF of a time series to see if it is consistent with the one for a AR(1) model. You can check that, if Y t is a general AR(p) process, E[Y t ] = α/(1 p j=1 β j). 5.1.2 MA models The moving average (MA) models assume instead that the time series is generated by combination of white noises. 4

A moving average model of order q, also called MA(q), is of the form Y t = W t + q θ h W t h, h=1 where θ h, h = 1,..., q are unknown coefficients and W t is a white noise process of variance σ 2. Since Y t is a finite linear combination of white noise terms, the process is stationary with zero mean. The autocovariance function is { σ γ Y (h) = cov(y t, Y t+h ) = 2 q h θ jθ j+h if 0 h q 0 if h > q where θ 0 = 1, and the ACR is q h θ jθ j+h ρ Y (h) = q if 0 h q θ2 j 0 if h > q An important feature of the MA(q) process is exactly that the autocorrelation is zero for times with lag larger than q. It is also relevant to note that moving average process do not have a unique expression. For example, the processes Y t = W t + 0.5W t 1 with W t white noise of variance 4 and Z t = V t + 2V t 1 with V t white noise of variance 1, are equivalent (you can check that they are both stationary process with the same mean and autocovariance function). 5.1.3 ARMA models Mixed autoregressive moving averages (ARMA) processes model the time series as the sum of an autoregressive part and a moving average part. 5

A mixed autoregressive and moving average model of autoregressive order p and moving average order q, also called ARMA(p, q), is of the form p q Y t = α + β j Y t h + W t + θ h W t h, j=1 where α, β j, θ h, h = 1,..., q and j = 1,..., p, are unknown coefficients and W t is a white noise process of variance σ 2. h=1 Remark 5.4. (Parameter redundancy) The same ARMA process can be parametrized in multiple ways. For example, let Y t = W t be a white noise. Then, Y t 1 = W t 1 is the same (shifted) time series. If we now take a linear combination of the two, we get or Y t βy t 1 = W t βw t 1 Y t = βy t 1 + W t βw t 1 (1) which is a ARMA(1, 1) process. However, we know that Y t is a white noise! Therefore, we have masked the white noise behind a overparametrization (a range of admissible values for β lead to the same process). To overcome this problem (as well as the non uniqueness of the moving average model), an additional conditions may be imposed on the ARMA process. Let us first define the autoregressive (AR) and moving average (MA) polynomials as and β(z) = 1 β 1 z β p z p θ(z) = 1 + θ 1 z + + θ q z q respectively. To avoid the parameters redundancy, we can ask that the two polynomials have no common factor and include this in the definition of ARMA model. This is why the process (1) is not usually called ARMA(1,1), because it can be reduced to white noise. Moreover, the form of the AR polynomial is linked to the causality of the process by the following proposition. Proposition 5.5. An ARMA(p, q) process is causal if and only if β(z) 0 for z < 1, i.e. if the roots of β(z) lie outside of the unit circle. The parameters of ARMA processes can be estimated from the time series data by maximum likelihood. The validity of the model assumptions can be then checked looking at the residuals. 6

5.2 Non stationary time series: Seasonality and Trend In many cases of interest the time series is not stationary, either because the mean changes over time or because the autocovariance function is not a function of the distance between two time points. The latter problem is more difficult to address and it is outside the scope of the course. However, the change in the mean can be included in the model using tools we have seen in the previous parts of the course. A seasonal pattern is present in the series when data are influenced by seasonal factors (e.g., the quarter of the year, the month, day of the week, the hour of the day). The period P of the seasonal effect is known and we can assume that the mean change follows the same periodical dynamic. Y t = µ t mod P + E t where t mod P is the reminder of t divided by the period P and E t is a stationary time series. µ t mod P is called seasonal component. In practice, we can estimate the mean by averaging across the repetition of the same period. For example, if data span multiple years and we think that a yearly seasonal pattern is present, we can estimate the mean for January as the average of all the values in January over the years. In time series analysis, it is usually called trend any long-term and non seasonal increase or decrease in the data, i.e. a function f(t) such that Y t = f(t) + E t, with E t stationary time series. The trend f(t) can be estimated using parametric (or non-parametric) regression methods we have seen in the previous part of the course. In case of a linear trend, an alternative is also to differentiate the time series to remove the trend. Therefore, we can decompose a non stationary time series as the sum of a seasonal term, a trend and a stationary residual process: Y t = S t + T t + E t. 5.3 Forecasting The goal is now to predict the future values of a time series Y tn+m, m = 1,..., M. We will denote with Ŷt n+m t n the prediction (or forecast) of Y tn+m based on the knowledge of the time series up to time t n. The accuracy of the forecast is usually measured by mean square error, i.e. we want E[Y tn+m Ŷt n+m t n ] to be small. We discuss briefly a gallery of basic method that are often quite effective in practice (when used wisely). 7

5.3.1 Average method The average method simply predict all future values of the time series with the average of the observed series, i.e. t n Ŷ tn+m tn = 1 Y t. n t=t 1 If the time series is stationary, we are predicting future observations with an estimate of their mean. Moreover, when m becomes large, the future observations are less and less influenced by what happened in the observed window of time and therefore the best we can expect to do is predict them using their expected value. 5.3.2 Naive method The naive method forecasts the future observations using the last available value, i.e. Ŷ tn+m t n = Y tn. The idea here is that if adjacent observations are highly correlated, the future observations will be similar to the last one, at least for small values of m. 5.3.3 Simple exponential smoothing In place of the naive method, where all forecasts for the future are equal to the last observed value of the series, or the average method, where all future forecasts are equal to a simple average of the observed data, we may want to use something in between that take into account all the observed time series but giving more weight to the most recent observation. This approach is called simple exponential smoothing, where forecasts are obtained by weighted averages where the weights decrease exponentially as observations come from further in the past, i.e. Ŷ tn+1 t n = αy tn + α(1 α)y tn 1 + α(1 α) 2 Y tn 2 + 0 < α 1 plays the role of the smoothing parameter, when α goes to 1 we get back the naive predictor, while for α small more and more weight is given to observations from the past. Alternatively, the simple exponential smoothing forecast can be written in a recursive way: Ŷ tn+1 t n = αy tn + (1 α)ŷt n t n 1, 8

This needs to set the initial value for the forecast and it is usually taken as Ŷ t1 t 0 = Y t1. Note that, if the sample size of the time series is large enough, we expect this initial value to have very little weight in the forecast. There is also the issue of how to choose the smoothing parameter α. This can be done in a subjective way, maybe looking at the autocorrelation function, or α can be chosen by minimising the sum of squared error SSE = t n t=t 1 (Y t Ŷt t 1) 2. 5.3.4 Model based forecasting In general, the best forecast in terms of minimizing the MSE is the conditional expectation Ŷ tn+m t n = E[Y tn+m Y t1,..., Y tn ] but to compute this we need to know the joint distribution of Y tn+m, Y t1,..., Y tn. This is possible only for relatively simple models and for this reason much of the much of theory of prediction restricts attention to linear predictors of the form Ŷ tn+m tn = ω t Y t, t=t 1 where ω t are suitable weights chosen to minimise the MSE (we will discuss this approach further in the context of spatial statistics). In the case of ARMA models, the minimum MSE prediction is obtained with the following recursive procedure: (1) future values of the white noise are set to zero (2) future values of the process are taken equal to their conditional expectation (3) present and past values of W t and Y t are taken equal to their observed values. For example, for the ARMA(1,1) process, the minimum MSE forecast up to time t n + m will be { β1 Y Ŷ tn+h tn = tn + θ 1 W tn for h = 1 β 1 Y tn+(h 1) for h = 2,..., m In practice, however, we need to plug in the maximum likelihood estimates in place of the true parameters and residuals in place of the past and present white noise. Thus the resulting predictor is not guaranteed to minimize the mean square error. t n 9

5.3.5 Series decomposition If the time series is not stationary, we may want first to decompose the series in its seasonal, trend and stationary component and forecast each component separately. Each component may also need a different forecasting method. For example, the forecasting of a parametric term can be done with the prediction from the linear model fitted with generalized least square. We have already seen the expression of the generalized least square estimator for the coefficients of a linear model when estimating the coefficients of the mixed effects models once the variance parameters are known (equation (3) of Section 4 of the lecture notes). The same idea applies here but the covariances between the observations are given by the time dependence. For the seasonal component, an average or naive method can be used for forecasting, while an ARMA model may be fitted to the stationary residuals. References Shumway, R. H., & Stoffer, D. S. (2010) Time series analysis and its applications: with R examples. Springer Science & Business Media. Hyndman, R. J., & Athanasopoulos, G. (2014) Forecasting: principles and practice. OTexts. https://www.otexts.org/fpp. 10