Transforms and Truncations of Time Series

Size: px

Start display at page:

Download "Transforms and Truncations of Time Series"

Malcolm Bridges
5 years ago
Views:

1 Transforms and Truncations of Time Series Adrian N. Beaumont Submitted in Total Fulfilment of the Requirements of the Degree of PhD Science August 2015 School of Mathematics and Statistics The University of Melbourne

2 ABSTRACT A time series can be defined as a collection of random variables indexed according to the order they are obtained in time. Examples of time series are monthly Australian retail sales, or quarterly GDP data. Forecasting of time series is generally considered much more important than fitting. Models that use exponential smoothing methods have been found to perform well on time series. Chapter 2 describes the estimation and forecasting procedure of additive forms of time series models; these include the local level model, local trend model, damped trend model, and seasonal equivalents. This chapter also briefly discusses some other time series methods, and introduces the M3-competition data that is extensively used in this thesis. Models that include multiplicative components for time series are considered in Chapter 3, increasing the total number of possible models from 6 to 30. While multiplicative models are often better than purely additive models,

3 Abstract 3 model selection methods using all combinations of multiplicative and additive models are found to be no better statistically than just selecting using the purely additive models; model selection methods are confused by the large number of possible models. In this thesis, transforms and truncations are used with exponential smoothing, in the quest for better forecasts of time series. Two types of transforms are explored: those applied directly to a time series; and those applied indirectly, to the prediction errors. The various transforms are tested on a large number of time series from the M3-competition data, and analysis of variance (ANOVA) is applied to the results. We find that the non-transformed time series is significantly worse than some transforms on the monthly data, and on a distribution-based performance measure for both annual and quarterly data. To try to understand why the transforms perform as they do, a simulation study was carried out, using simulations from a paper on outliers. Three types of simulations were used: a Level Shift permanently shifts the series to a new level; an Additive Outlier increases the series for only one time period; and a Transitory Change gradually reverts the series to the old level after the

4 Abstract 4 jump point. The non-transformed time series were significantly worse than some transforms on some simulation types. Truncations are applied so that there is no possibility of obtaining an observation below zero on a positive-definite time series. There are two types of truncations: those applied only to the forecasts, and those applied to the fits and forecasts. By using the same methods as for the transforms, we found that the truncations worked better when applied only to the forecasts, but the non-truncated model was never significantly worse than any truncation. Chapter 7 combines transforms with truncations. We find that applying the heteroscedastic state space transform with a truncated normal significantly improved forecasts over the non-transformed results. The final chapter of this thesis investigates how various properties of time series affect the forecasting performance. Of particular interest is the finding that a measure commonly used to assess prediction performance is flawed.

5 Abstract 5 Declaration This is to certify that: 1. The thesis comprises only my original work towards the degree of PhD Science except where indicated in the Preface, 2. Due acknowledgement has been made in the text to all other material used, 3. The thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Preface After completing my Bachelor of Science Degree with Honours in 2004, I worked for Associate Professor Ralph Snyder of the Department of Econometrics at Monash University for seven years, from 2005 to During this period, I co-authored two papers, which are cited in this thesis. I also

6 Abstract 6 completed the computer work necessary to obtain the results in Chapter 7 of Hyndman, Koehler, Ord and Snyder (2008a), which led to some of the computational work on the transforms and truncations used in this thesis. I returned to the University of Melbourne in 2012 to commence this thesis. When reading this document on screen, a reference contained in a red box, such as a heading or an equation number, is connected to the relevant material via a clickable link, for convenience of the reader. Acknowledgements I acknowledge the valuable feedback into earlier drafts of this thesis of Associate Professor Ray Watson, and Ralph Snyder, for which I am grateful. My deepest appreciation goes to my supervisor, Professor Ian Gordon. I have usually met with Ian once a fortnight for the duration of this thesis, and he has been a great help, making many suggestions that have improved the thesis, and providing detailed feedback on each chapter.

7 CONTENTS List of Tables List of Figures A Brief Overview of this Thesis An Introduction to Innovation State Space Models Time Series: the General Context The Local Level Model Local Trend and Damped Trend Models Seasonal Models

8 Contents A Generalized State Space Model Estimation Forecasting Model Selection The M3-Competition Data Prediction Validation An Example Other Time Series Methods ARIMA Kalman Filter Multiple Sources of Error (MSOE) Models The Theta Method An Evaluation of the Performance of Exponential Smoothing Methods

9 Contents Introduction Estimation and Model Selection Future log Likelihood The Rolling Origin Method MASE Experimental Design Results Spearman s ρ Correlations for the Performance Measures Why MLPL and Point Prediction Measures Differ Conclusions Data Transforms with Exponential Smoothing Methods of Forecasting

10 Contents Introduction Transforms on the series The Log Transform The Box-Cox Transform Transforms on the errors JETS Transform HSS Transform Using the t distribution Continuous Ranked Probability Score (CRPS) Why the RPS and MLPL can give differing results Experimental Design Results Tukey s Comparisons

11 Contents Encompassing Model vs AIC Method Conclusions A Simulation Study of the Transforms Introduction LS Results AO Results TC Results Overall Results A General Overview of the HSS Results Some Additional Randomisation of the Simulations Conclusions Non-Negative Distributions for Time Series

12 Contents Introduction The Gamma Distribution The Gamma Distribution Formulation Gamma Prediction Distributions The Lognormal Distribution The Truncated Normal Distribution The Basic Idea Application to Time Series Data Results Restriction to series where truncation is necessary Conclusions Transforms Combined with Truncations The Truncated Transforms

13 Contents Log Transform Box-Cox Transform HSS Transform JETS Transform Truncated t Distribution Results Conclusions An Investigation of Other Time Series Properties Introduction Why MAPE Instead of MASE was Used Annual Results Quarterly Results Monthly Results

14 Contents Further Research Conclusions Overall Conclusions Limitations and Possible Further Research Appendix 226 A. Additional Tables Bibliography

15 LIST OF TABLES 2.1 Transition and Measurement Equations for Six Additive Models Fitting length statistics on the M3 data Six Models for US employment data Various Forms of the Damped Trend Model Rolling Origin Illustration ETS Models with AIC for Annual Data ETS Models with AIC for Quarterly Data ETS Models with AIC for Monthly Data Spearman s ˆρ Correlations for the Performance Measures... 98

16 List of Tables Annual Data with One-Step Ahead Point Prediction Measures Original Transform Means, Best Transforms in Bold Means of Logged Transform Data, Best Transforms in Bold Transform Groupings for Annual Data Transform Groupings for Quarterly Data Transform Groupings for Monthly Data AIC - ENC Log Means (Negatives imply that AIC is better) Log LS Results by Grouping % of times correct local level model selected, and mean ˆα for LS case Log AO Results by Grouping % of times correct local level model selected, and mean ˆα for AO case

17 List of Tables Log TC Results by Grouping % of times correct local level model selected, and mean ˆα for TC case Log All Simulation Type Results by Grouping % of times correct local level model selected, and mean ˆα for all cases Log LS Random Results by Grouping % of times correct local level model selected, and mean ˆα for LS random case Log AO Random Results by Grouping % of times correct local level model selected, and mean ˆα for AO random case Log TC Random Results by Grouping % of times correct local level model selected, and mean ˆα for TC random case

18 List of Tables Log All Simulation Type Random Results by Grouping % of times correct local level model selected, and mean ˆα for all random outliers Averages of Prediction Measures Averages of Logs of Prediction Measures Grouping Results for Annual Data Grouping Results for Quarterly Data Grouping Results for Monthly Data Grouping Results when restricted to 20 total series with at least one negative prediction on normal distribution Grouping Results for Annual Data Grouping Results for Quarterly Data Grouping Results for Monthly Data

19 List of Tables Outlier Distribution for Each Frequency Data Categories for Each Frequency General Linear Model Fit for Annual Data Interpretation of Interaction for Annual Data General Linear Model Fit for Quarterly Data Interpretation of Interaction for Quarterly Data General Linear Model Fit for Monthly Data Interpretation of Interaction for Monthly Data: Part 1, Category vs log scsig Interpretation of Interaction for Monthly Data: Part 2, nout vs nfit A.1 Means of Performance Measures in Original Scale for Annual Data

20 List of Tables 20 A.2 Means of Performance Measures in Original Scale for Quarterly Data A.3 Means of Performance Measures in Original Scale for Monthly Data A.4 Mean Ranks of Performance Measures for the Annual ETS Models A.5 Mean Ranks of Performance Measures for the Quarterly ETS Models A.6 Mean Ranks of Performance Measures for the Monthly ETS Models

21 LIST OF FIGURES 2.1 US Employment Level, January 2000 to December 2011, not seasonally adjusted Employment Fits, Predictions and 90% PIs d = log(a, N, N) log(a, A d, N) for both ASE1 and MLPL Observations and Predictions for Two Models for Annual Series Comparison of Johnson and Normal Distributions Models of Types of Simulations Individual AO series

22 List of Figures Pdfs of truncated normal random variables Log MASE ID vs log scaled sigma for Monthly Data: Spearman s ˆρ = Log MAPE ID vs log scaled sigma for Monthly Data: Spearman s ˆρ = Log MAE vs log scaled sigma for Monthly Data: Spearman s ˆρ = Log MASE vs log MAPE for Monthly Data: Spearman s ˆρ = Log MAE vs log Mean Future Observations for Monthly Data: Spearman s ˆρ = Log MAE vs log Minimum Future Observations for Monthly Data: Spearman s ˆρ =

23 1. A BRIEF OVERVIEW OF THIS THESIS This thesis uses transforms and truncations of time series models to attempt to improve the forecasting performance of these models. Transforms involve using a function such as f(y) or f(e) to transform either the series observations or the errors, while truncations cut the distribution used for time series at zero to avoid problems with negative values in series that can take only positive values. Chapter 2 gives an introduction to time series models, in which the estimation and forecasting procedure for six additive models are described. These six models are the local level model, local trend model, damped trend model, and three seasonal equivalents. When applied to past time series observations,

24 1. A Brief Overview of this Thesis 24 the local level model has the form: e t = y t l t 1 ; l t = l t 1 + αe t. (1.1) Here, y t is the t th observation of a time series, and l t is the level at time t, with l t 1 acting as the one-step ahead prediction of y t. α is a constant that smooths the level, and can be between 0 (no smoothing, in which the level of the time series remains constant) and 1 (a random walk in which the level changes according to the current observation). The formulation for the local trend model (Ord, Koehler and Snyder 1997), for t = 1, 2,..., n, is: µ t = l t 1 + b t 1 ; e t = y t µ t ; l t = l t 1 + b t 1 + αe t ; (1.2) b t = b t 1 + βe t. Here, µ t is the one-step ahead forecast, b t is the gradient at time t, and β is

25 1. A Brief Overview of this Thesis 25 a smoothing parameter that affects the next period s gradient. Now, as well as having a level that can vary, we now have a variable gradient; this is why this model is known as a local trend model. A damped trend model can be obtained by replacing all instances of b t 1 with φb t 1 in Equation (1.2). φ is a damping constant between 0 and 1. Seasonal models allow for the repetition of a seasonal cycle. The formulation for a damped trend model with seasonality is: µ t = l t 1 + φb t 1 + s t m ; e t = y t µ t ; l t = l t 1 + φb t 1 + αe t ; (1.3) b t = φb t 1 + βe t ; s t = s t m + γe t. Here, s t is the seasonal effect, and γ the seasonal smoothing parameter, which has restrictions 0 γ 1 α. Note that unlike l t and b t, s t has a lag of m periods. This is so that the previous occurrence of a season is used. On occasion, we want to multiply components such as trend and level,

26 1. A Brief Overview of this Thesis 26 rather than add them; if models with multiplicative components are included, there are a total of 30 models (Hyndman, Koehler, Ord and Snyder, 2008a, Chap 2). Chapter 3 gives results for these 30 Error, Trend, Seasonal (ETS) models using the M3-competition database of time series (Makridakis and Hibon 2000). It also introduces a distribution-based forecasting performance measure, which we call the future log likelihood (Gneiting and Raftery 2007). We used two model selection methods in this chapter based on the Akaike Information Criterion (AIC) (Akaike 1974): AIC-all selects from all 30 ETS models, and AIC-add selects from only the six additive models. The Analysis of Variance (ANOVA) analyses in this chapter show that model selection methods that use all models are in no cases significantly better than using only the additive models. Chapter 4 describes the transforms of time series as applied to additive models. Two transforms were applied to the series: the log transform and the Box-Cox transform, while three transforms were applied to the errors: the Johnson Error, Trend, Seasonal (JETS) transform, the heteroscedastic state space (HSS) transform and the t transform. Another distribution-based measure called the Ranked Probability Score (RPS) is introduced, and the various transforms are compared on the M3 data using ANOVA. The HSS transform,

27 1. A Brief Overview of this Thesis 27 with one parameter estimated by a proportional approach, is never significantly worse than any other transform. This chapter is based on a paper by the author (Beaumont 2014). Chapter 5 uses a simulation study to further assess the transforms. This study uses a paper on outliers (Koehler, Snyder, Ord and Beaumont 2012) to generate the simulated series. The simulated series can have either an Additive Outlier (AO), in which the series returns to its original level immediately after the outlier; a Level Shift (LS), in which the series moves permanently to a new level; or a Transitory Change (TC), in which the series returns gradually to its original level following a jump point. Analysis for all transforms is given for all simulation types, and for a combined sample of all the simulation types. Chapter 6 assesses truncations of time series using the same methods as in Chapter 4. The normal distribution can occasionally give negative predictions, so three distributions were used to stop negative predictions on positive definite series: the gamma distribution, lognormal distribution and truncated normal distribution. We tried applying these truncations to both the fits and the predictions, or just the predictions. We found that truncations applied

28 1. A Brief Overview of this Thesis 28 only to the predictions are generally better, but the non-truncated distribution is not significantly worse than any truncated distribution on the M3 data. Chapter 7 combines the transforms of Chapter 4 with the truncated distributions of Chapter 6. The truncated HSS proportional c transform is never significantly worse than any other method, and sometimes it is significantly better than the non-transformed series. As a result, we recommend this method. Chapter 8 assesses how various properties of time series affect forecasting performance. The most interesting finding of this chapter is that a measure commonly used for prediction performance is flawed. Most other results in this chapter were as expected: prediction performance becomes worse as the estimated standard deviation of the model selected by the AIC increases. Some types of transforms and truncations are found to be statistically better than the non-transformed time series in forecasting performance. As a result, these methods are recommended over the non-transformed time series. In general, discussions of statistical significance use a threshold of P < 0.05,

29 1. A Brief Overview of this Thesis 29 commonly adjusted for multiple comparisons using Tukey s method. All logarithms are natural logarithms. Most computer work in this thesis was completed using the mathematical programming language Matlab version R2010b, with the statistical software Minitab versions 16 and 17 used for ANOVA and Tukey s comparisons. Minitab 16 was adequate until Chapter 8, where Minitab 17 was used.

30 2. AN INTRODUCTION TO INNOVATION STATE SPACE MODELS 2.1 Time Series: the General Context A time series can be defined as a collection of random variables indexed according to the order they are obtained in time; for example in the sequence of random variables Y 1, Y 2, Y 3,...; the random variable Y 1 represents the value of a time series at period one, and similarly for all other periods (Shumway and Stoffer 2000, Chap 1). While time series can use continuous time, or be irregularly spaced, the discrete, regular interval time series is the definition we will use in this thesis, as these series are the most important in practice (Shumway and Stoffer 2000, Chap 1). The intervals can be annual, quarterly, monthly, weekly, or other,

31 2. An Introduction to Innovation State Space Models 31 but data should be observed every month for monthly data, every year for annual data, etc. The usual and preferred way for time series to be graphed is by putting time periods on the x-axis, and the observations corresponding to each time period on the y-axis. Examples of time series are daily movements in stock closing prices, weekly US claims for unemployment benefits, monthly Australian retail sales data, quarterly gross domestic product for many countries and annual global average temperatures. Time series are usually autocorrelated, that is, observation y t of a time series will to some extent be associated with observation y t 1 (Woodward, Gray and Elliott 2012, Chap 1), because of relevant causal phenomena that are operating in periods close together in time. In seasonal time series, y t is often similarly associated with the observation in the previous occurrence of the season, e.g., in a monthly series last February s result will be correlated with this February s outcome, because of causal phenomena particular to February. For these reasons, statistical methods that assume independence of observations are inappropriate for time series. The autocorrelation function (ACF) of a time series is a graphical representation of the correlation between y t and y t k, where k is an integer. For most

32 2. An Introduction to Innovation State Space Models 32 non-seasonal time series, we expect the ACF to be high at k = 1, and to decay towards zero as k becomes larger. Time series and time series analysis have several applications. First, we may want to forecast events such as the weather or a country s economic output. An accurate forecast of the daily weather in a particular city gives people in that city information about what clothes they should wear, or whether to pack an umbrella. If accurate forecasts of economic output can be made, they may allow governments and business to confidently set their agenda. A second application of time series is to assess the impact of a single event, such as the impact of the 11 September 2001 terrorist attacks on political polling, or the impact of the global financial crisis of 2008 on economic output. In this application, the observation corresponding to the single event is evaluated in relation to the overall pattern of the series. This is strongly related to ideas from statistical quality control. A third application is to assess whether there is a relationship between two or more time series, such as between the unemployment rate and the economic output. In these cases the cross-correlation may be a relevant decriptive measure; the concept of Granger causality (Granger 1969) is also relevant

33 2. An Introduction to Innovation State Space Models 33 to this application. A fourth application is stylised facts. This term is used in a number of ways. It may refer in a rather vague and general way to overall observations or conclusions that have been made in so many contexts that they are widely accepted as true. More narrowly, but related, is the usage that refers to analysis of characteristics of classes of time series; for example, financial time series tend to have a high kurtosis and slowly decaying autocorrelation function (Malmsten and Terasvirta 2004). This thesis will focus on forecasting, as predicting the future is a very important application of time series (Wei 2006, Chap 5; Ord and Fildes 2013, Chap 1). For example, the first 30 observations of a time series of length 36 may be known. It is not of great interest to provide a good fit to these 30 observations. What is of far more importance is forecasting what will happen in the next six observations. In this thesis, we will assess time series methods based on how well they forecast, and not on whether they provide good fits to the data already observed. When more general statistical methods are used for prediction, the predictors can occur anywhere in the space of the predictor variables. On the other

34 2. An Introduction to Innovation State Space Models 34 hand, for time series, the observations to be predicted come from the future. In a time series, for prediction purposes, we are much more concerned with what has been happening in the recent past than what was happening a long time ago. Thus, methods are required that focus attention on the most relevant time periods. 2.2 The Local Level Model The essential nature of a time series is that observations are made progressively: the series evolves in time. It is sensible to suggest models which incorporate this idea, by focusing on the sequential way subsequent observations follow from those before them. These models imply methods for forecasting. One such perspective, described in this thesis, is the method of exponential smoothing (Brown 1959). Brown (1959) invented the method of simple exponential smoothing, and Muth (1960) introduced two statistical models for which the optimal forecasts are equivalent to those obtained by simple exponential smoothing. With t = 1, 2, 3,..., n, the formulation for the local level model for future observations

35 2. An Introduction to Innovation State Space Models 35 y t is: y t = l t 1 + e t ; l t = l t 1 + αe t. (2.2.1) Here, l t is the level (or series fit) at time t, with l t 1 acting as the one-step ahead prediction of y t. The e t are usually assumed to be normally, independently distributed with mean 0 and variance σ 2. α is known as the smoothing parameter, and is such that 0 α 1. This model is like estimating just the mean of the series, but here the mean, represented by l t, is able to vary over time. When applying this model to past values of an actual series, the residual errors are defined as: ê t = y t l t 1. Substituting this into the second line of (2.2.1) gives us a slightly different form. Then we iterate for t = 1, 2,..., n. ê t = y t l t 1 ; (2.2.2) l t = (1 α)l t 1 + αy t. (2.2.3) In principle, we could use this form to obtain predictions. A problem is that l 0 must be known before (2.2.2) can be applied to the first observation of time

36 2. An Introduction to Innovation State Space Models 36 series y. Since l 0 cannot be determined from the model, it is estimated; this is known as the estimation of a seed state. Further, α must also be estimated. The standard approach is to seek the values ˆl 0 and ˆα that are such that the sum of squared residual errors, ê 2 t, is minimised. The estimated variance can then be calculated as ˆσ 2 = ê 2 t /n. When α = 0, there is no dependence in the predictions on y t, and we have l n = l n 1 = = l 0. In this case l 0 should be estimated by the mean of the whole series, so the case α = 0 is equivalent to a simple regression model where only the mean is estimated. At the other extreme, at α = 1, we have l t = y t, and the forecast for the next period will be equal to the observation for the current period. α is known as the smoothing parameter because a value of α between 0 and 1 will usually provide a better and smoother fit to the series than either of the above extremes. In general, by recursively substituting l t in (2.2.3), an exponentially weighted

37 2. An Introduction to Innovation State Space Models 37 relationship can be obtained for l k as a function of l 0 and all y: l k = (1 α)l k 1 + αy k ; = (1 α) [(1 α)l k 2 + αy k 1 ] + αy k ; = (1 α) 2 l k 2 + α(1 α)y k 1 + αy k. If we continue to substitute for each l k i, we obtain the following relationship l k = (1 α) k l 0 + α [ ] (1 α) k 1 y (1 α)y k 1 + y k ; = (1 α) k l 0 + α (2.2.4) k ( ) i=1 (1 α) k i y i. As α increases, the more recent observations become more heavily weighted. 2.3 Local Trend and Damped Trend Models Many time series have a clear linear trend over time that may be towards greater or lesser values of the time series. This trend can increase, disappear, or even reverse, as happened for many economic time series during the global financial crisis. As a result, we need to be able to smooth the trend as well as the level. This method was first suggested in separate articles by both

38 2. An Introduction to Innovation State Space Models 38 Holt (1957, reprinted 2004) and Winters (1960), so it is known as the Holt- Winters method. The formulation for the local trend model (Ord, Koehler and Snyder 1997), for t = 1, 2,..., n, is: µ t = l t 1 + b t 1 ; e t = y t µ t ; l t = l t 1 + b t 1 + αe t ; (2.3.1) b t = b t 1 + βe t. Here, µ t is the one-step ahead forecast, b t is the gradient at time t, and β is a smoothing parameter that affects the next period s gradient. Now, as well as having a level that can vary, we now have a variable gradient; this is why this model is known as a local trend model. Four parameters must now be estimated so as to minimise the sum of squared errors: l 0, b 0, α and β. The restrictions on α and β are 0 α 1 and 0 β α. When β = 0, there is no change in the growth rate. The damped trend method (Gardner and McKenzie 1985) has an extra parameter φ, 0 < φ 1, that provides damping of the trend factor. Replacing all instances of b t 1 in (2.3.1) with φb t 1 gives the damped trend model for-

39 2. An Introduction to Innovation State Space Models 39 mulation (Hyndman, Koehler, Ord and Snyder 2008a, Chap 2), which is: µ t = l t 1 + φb t 1 ; e t = y t µ t ; l t = l t 1 + φb t 1 + αe t ; (2.3.2) b t = φb t 1 + βe t. The damped trend model thus has one more parameter that needs to be estimated than the local trend model. When φ = 1 there is no damping, and at the other extreme, when φ = 0, there would be no growth term, so we restrict φ to be greater than Seasonal Models Many time series have their data collected in monthly or quarterly intervals, and many of these will exhibit a strong seasonal pattern, in which it is clear that one particular month or quarter is doing better than other months or quarters. For example, for societies with a strong emphasis on Christmas, we would expect retail sales to do much better in December than in other

40 2. An Introduction to Innovation State Space Models 40 months of the year, due to Christmas shopping. Occasionally, the seasonal pattern itself changes during the course of a time series, and so it must also be smoothed. Below is the formulation for a damped trend model fitted with m seasons. µ t = l t 1 + φb t 1 + s t m ; e t = y t µ t ; l t = l t 1 + φb t 1 + αe t ; (2.4.1) b t = φb t 1 + βe t ; s t = s t m + γe t. Here, s t is the seasonal effect, and γ the seasonal smoothing parameter, which has restrictions 0 γ 1 α. Note that, unlike l t and b t, s t has a lag of m periods. This is so that the previous occurrence of a season is used. Seed state estimation can become cumbersome when seasonal models are considered, as we need initial estimates of the m seasons, with the restriction that the sum of the initial seasonal indices is zero; this restriction is so that overall, seasonality does not contribute. When γ = 0, there is no change in the seasonality pattern of a time series. Note that exponential smoothing can accommodate a series that initially has

41 2. An Introduction to Innovation State Space Models 41 strong seasonality, which then diminishes; the γ parameter in the seasonal model will then be large. 2.5 A Generalized State Space Model The l t, b t and s t from the previous sections can all be collectively referred to as states. They all exhibit some change with time, and knowing what they are means we know what the model is doing at time t. Any time series model that has only additive components (only pluses and minuses between distinct components such as level, trend and error) can be described using the following formulation for t = 1, 2,..., n (Hyndman et al., 2008a, Chap 1): µ t = w x t 1 ; (2.5.1) e t = y t µ t ; (2.5.2) x t = Fx t 1 + ge t. (2.5.3) Here, x t is the vector of all states. If there are k states, then x t will be a

42 2. An Introduction to Innovation State Space Models 42 k 1 vector. Since µ t must be a scalar, w must be a 1 k vector, and (2.5.1) is known as the measurement equation because it produces a measurement that predicts what the series should be at time t based on information up to time t 1. Equation (2.5.3) is called the transition equation because it describes how the states evolve over time. F is a k k matrix, and g is a k 1 vector of smoothing parameters. e t are the one-step ahead errors. Because the same e t are used in both Equations (2.5.2) and (2.5.3), this formulation is known as an innovations state space model. Restrictions on the values of α, β, γ and φ for the previous models are as in Chapter 2 of Hyndman et al. (2008a), and are known as the traditional approach. There are slightly different formulations of the trend and seasonal models than those presented here, in which β = β/α and γ = γ/(1 α). In the traditional approach, it is desirable to have 0 α, β, γ 1. This gives us the β α and γ 1 α restrictions. Snyder, Ord and Koehler (2001) and Hyndman, Akram and Archibald (2008) show that these restrictions are often stricter than necessary, and that if all eigenvalues of the matrix D = F gw are inside the unit circle, a milder restriction, we have stable forecasts a desirable property. For the local level

43 2. An Introduction to Innovation State Space Models 43 model, using this would allow a range for α of [0,2), rather than [0,1]. From personal experience, using the milder stability conditions leads to inferior forecasts, and can cause computer crashes, so the traditional restrictions will continue to be used in this thesis. Below are examples of how models discussed in previous sections can be converted into the state space format. For the local level model, there is only one state, l t, so x t = l t, w = F = 1 and g = α. For the damped trend model, there are two states, so x t = [ l t b t ], w = [ 1 φ ], F = 1 φ 0 φ and g = [ α β ]. For seasonal models, first divide the states into seasonal and nonseasonal components. The non-seasonal components can be done as above, and then joined to the seasonal components. There are m seasons going from s t to s t m+1, while the 1-lag seasonals go from s t 1 to s t m. As a result, the seasonal component of w is w s = [ 0 1,m 1 1 ],

44 2. An Introduction to Innovation State Space Models 44 since the measurement equation depends on s t m. By similar reasoning, F s = 0 1,m 1 1 I m 1 0 m 1,1 and g s = [ γ 0 1,m 1 ]. If m is the number of seasons, and q is the number of non-seasonal states, and ns are non-seasonal components, then the full matrix F is given by, F = F ns 0 q,m 0 m,q F s. Other full vectors are made by putting the seasonal components below the non-seasonal components. There are other conceivable additive models than those listed above. One such model is a local level model with a constant gradient term, in which we would set β = 0 in the local trend model. A quadratic term could also be introduced in the exponential smoothing equations. However, these other models have not been effective at forecasting (Gardner 1985), and so we will use the following six additive models in this thesis. 1. Local Level Model

45 2. An Introduction to Innovation State Space Models Local Trend Model 3. Damped Trend Model 4. Seasonal with non-seasonal local level 5. Seasonal with non-seasonal local trend 6. Seasonal with non-seasonal damped trend The first three models listed here can be applied to all time series; the last three can only be applied to time series with a seasonal pattern, such as quarterly or monthly data. The measurement and transition components of (2.5.1) and (2.5.3) are illustrated in Table 2.1; the transition equation can span multiple lines. Multiplicative variations of these models are used in some time series applications; in these, some or all of the b t, l t and e t can be multiplied together instead of added together. The next chapter will provide more detail on multiplicative models.

46 2. An Introduction to Innovation State Space Models 46 Tab. 2.1: Transition and Measurement Equations for Six Additive Models Model Type Non-Seasonal Seasonal Local Level Local Trend Damped Trend e e t = y t l t = y t (l t 1 + s t m) t 1 l l t = l t 1 + αe t = l t 1 + αe t t s t = s t m + γe t e t = y t (l t 1 + b t 1 e t = y t (l t 1 + b t 1 ) + s t m ) l t = l t 1 + b t 1 + αe t l t = l t 1 + b t 1 + αe t b t = b t 1 + βe t b t = b t 1 + βe t s t = s t m + γe t e t = y t (l t 1 + φb t 1 ) l t = l t 1 + φb t 1 + αe t b t = φb t 1 + βe t 2.6 Estimation e t = y t (l t 1 + φb t 1 + s t m ) l t = l t 1 + φb t 1 + αe t b t = φb t 1 + βe t s t = s t m + γe t As in any model-fitting context, we seek to estimate the unknown parameters. In this case the main application of the parameter estimates is to produce forecasts. The general approach is to use maximum likelihood, with some minor variations as outlined below. Since a conditional likelihood approach will be used, the number of parameters, k, is not subtracted when estimating the residual variance of the model. That is, we use: ˆσ 2 = n t=1 ê2 t n, (2.6.1)

47 2. An Introduction to Innovation State Space Models 47 rather than an estimate with n k in the denominator. For additive models, all parameters can be estimated so as to minimise the sum of squared errors (SSE). However, the SSE for multiplicative models and transforms cannot be compared with the SSE for additive models. The likelihood for these other models can be compared, and we introduce it here. If ˆθ is the vector of all smoothing and damping parameters, then we can say that ê t = y t f(ˆθ, x 0 ) for all t, with the ê t assumed to be approximately N(0, σ 2 ). The likelihood can thus be calculated from the normal probability density function as: L = ( n 1 t=1 ˆσ exp ( ) ) 1 ê 2 2π 2ˆσ 2 t ; ( ) n ( ) = exp 1 2ˆσ 2 ê2 t ; = ( 1 ˆσ 2π 1 ˆσ 2π ) n exp ( n 2 ) because ê2 t = nˆσ 2. Taking the log of this likelihood gives the log likelihood, which is generally used for computational work. Thus ( log L = n log ˆσ ) 2π n 2 ; = n (2 log(ˆσ) + log(2π) + 1). (2.6.2) 2

48 2. An Introduction to Innovation State Space Models 48 Equation (2.6.2) is the form of the log likelihood used in the six models previously described. The aim is to optimise all seed states and smoothing and damping parameters, subject to parameter restrictions, so as to maximise this log likelihood for the model selected. 2.7 Forecasting So far we have looked at the estimation procedure for time series models, which applies for periods 1 to n. The forecast period is for periods n + 1 to n + h, where h is the number of periods required for forecasting. When applying Equations (2.5.1) and (2.5.3) to future values, the errors e t become random variables, and it is assumed that e t N(0, σ 2 ). Equations (2.5.1) and (2.5.3) are now rearranged to get: y t = w x t 1 + e t ; (2.7.1) x t = Fx t 1 +ge t. (2.7.2) This formulation can be used to generate future predictions of y t, which are used in prediction performance.

49 2. An Introduction to Innovation State Space Models 49 The expected value of the e t is zero, and from this a simple point prediction algorithm can be obtained for t going from n + 1 to n + h by replacing e t in Equations (2.7.1) and (2.7.2) with its expected value, zero: ŷ t = w x t 1 ; x t = Fx t 1. (2.7.3) Here, ŷ represents the forecast. The final state space x n from the estimation sample is used to seed (2.7.3). If j = 1, 2,..., h represents the forecast index, then, for the local level model, ŷ n+j = l n for all j, i.e., the point forecast is the same as the final level from the estimation period. For the local trend model, ŷ n+j = l n + b n j, i.e., the trend at time n is followed for the forecast period. For the damped trend model, ŷ n+j = l n + (φ + φ φ j )b n, i.e., the trend at time n becomes damped as j increases. For seasonal models, the seasonal pattern at time n is retained. To calculate the estimated forecast variance, the following algorithm can be used for j = 1, 2,..., h, where Var represents the estimated variance of the

50 2. An Introduction to Innovation State Space Models 50 bracketed expression: Var(y n+j ) = w Var(xn+j 1 )w + ˆσ 2 ; Var(x n+j ) = F Var(x n+j 1 )F + ˆσ 2 gg. (2.7.4) This is seeded with Var(x n ) = a k k zero matrix, where k is the number of states. This is derived by replacing e t in Equations (2.7.1) and (2.7.2) with the error variance, ˆσ 2, as calculated in (2.6.1). Equation (2.7.4) shows that the greater the smoothing parameters, the faster the future variance of any forecasts grow. This future variance can then be used for inference using a normal distribution in the usual way. For example, a 95% prediction interval for y n+j can be expressed as ŷ n+j ± 1.96 Var(y n+j ). The future variance formula for the local level model for time period n + j is: Var(y n+j ) = ˆσ 2 [1 + α 2 (j 1)]. (2.7.5) Future variance formulae for the other five additive models are given on page 82 of Hyndman et al. (2008a).

51 2. An Introduction to Innovation State Space Models Model Selection Suppose we have fitted all six models to a series, and we now want to know which model to actually select. If the true optimum of a model for a particular series can be found, then the log likelihood for a more general model, as found in Equation (2.6.2), must be equal to or greater than the log likelihood for a nested model. There are various information criteria that can be used for model selection, but the simple Akaike Information Criteria (AIC) (Akaike 1974) does well (Chapter 7 of Hyndman et al. 2008a). The AIC can also be used to compare non-nested models. If p i represents the total number of free parameters for model i, including both seed state and smoothing and damping parameters, then the AIC for model i is defined as: AIC i = 2 log L i + 2p i. (2.8.1) To select the best model, we look for the model that has the smallest AIC. Seasonal models have many more parameters, and so will be more heavily penalised by the AIC method. Note that seasonal models have one less free parameter than it first appears, because of the restriction that the sum of

52 2. An Introduction to Innovation State Space Models 52 the initial seasonal indices must be zero. Gardner (2006) points out that in many studies, using the damped trend model, with appropriate seasonality, has worked better than selecting a model using a criterion such as AIC. This suggests that using the non-seasonal damped trend model for non-seasonal series, and the seasonal damped trend model for seasonal series, will work well. If the damped trend model is always used, allowing the damping parameter φ to be zero makes sense, as this case of the damped trend is then equivalent to the local level model. Applying the damped trend model to all time series is an example of aggregate selection, in which the same model or method is used for all time series (Fildes 1989). The AIC uses individual selection, because a different method can be chosen for each time series. In Chapter 3, both aggregate and individual model selection methods are used. 2.9 The M3-Competition Data The effectiveness of statistical models can be assessed in a number of ways. Conventional approaches include theoretical comparisons, often based on

53 2. An Introduction to Innovation State Space Models 53 asymptotic results, and simulation. The area of forecasting is one which has embraced a more direct and relevant tradition, of comparing the performance of methods on large samples of actual series. While this poses questions of the representative nature of the samples, it is nonetheless an appealing exercise, because it examines performance for real (and therefore realistic) data. In order to assess which methods perform well, we need to compare them across a large selection of time series. One of the first competitions that assessed different time series methods on a large number of time series was Newbold and Granger (1974), which used 106 time series, and assessed three methods and combinations of those methods. Makridakis and Hibon (1979) used a similar number of time series, but had 13 core methods, rather than just three (Fildes and Makridakis 1995). A big jump in the number of series analysed came in the M-competition study (Makridakis et al., 1982), which used 1001 series. The M-competition and similar studies have been criticised on the grounds that they do not allow for manual overrides of the forecasting procedures (Newbold in the Armstrong and Lusk 1983 commentaries). It has been

54 2. An Introduction to Innovation State Space Models 54 argued that in a realistic situation, business would adjust forecasts given prior knowledge of events likely to affect performance. The M2-competition (Makridakis et al., 1993) attempted to address these concerns by allowing competition participants to adjust their forecasts after one year, but there was little change in the ordering of methods. The empirical studies have found that simpler time series methods have been better at forecasting than more complex methods (Fildes and Makridakis 1995). Methods such as the local level model, or even the benchmark random walk, or their seasonal equivalents, have outperformed more complex methods that do well when just the sample fit is assessed, but poorly on forecasting. In the rest of this thesis, the M3-Competition data (Makridakis and Hibon 2000) will be extensively used. The M3 data has 645 annual series, 756 quarterly series and 1428 monthly series. There were also 174 Other series, but these were not used for this thesis. These series are all real, and come from a variety of contexts, including industry, finance and demographic data. All of the M3 data have positive values. Many studies have used the M3 data to compare forecasting models or methods, including the original article;

55 2. An Introduction to Innovation State Space Models 55 Hyndman, Koehler, Snyder and Grose (2002); Taylor (2003); Hyndman and Billah (2003); Hyndman et al. (2008a, Chap 7); and Fildes and Petropoulos (2015). The length of each series in the M3 data varies, but in the original article and subsequently, in typical evalutions six observations have been withheld for annual data, eight for quarterly data and 18 for monthly data. For consistency with earlier studies, this thesis will also use these withheld data lengths. The withheld data can then be used to test how well forecasts actually perform using performance prediction measures such as MASE and MAPE. Withholding data and performance measures are described in the next section. The M3-competition data can be downloaded from the following website: (accessed 22 August 2014). Table 2.2 gives the total number of series for each frequency, and some statistics on the fit lengths of that frequency; the fit length excludes the withheld sample.

56 2. An Introduction to Innovation State Space Models 56 Tab. 2.2: Fitting length statistics on the M3 data Fit Length Statistics Nser Min Median Mean Max Annual Quarterly Monthly Prediction Validation In Section 2.7, we calculated forecasts for a model, but we did not have any actual data to measure these forecasts against. This problem can be solved using the method of prediction validation, in which we withhold a pre-set number of observations from the estimation process; these observations can then be used to test how well the predictions perform. For example, in a time series of 36 periods, we may use 30 periods for estimation (the fitting sample), and withhold observations from six periods for validation. In the following discussion, we define the forecasting errors as: e j = y j ŷ j, where j is the index of the forecast, y j is the actual observation, and ŷ j is the point prediction. Many methods have been proposed to examine prediction performance for point predictions. Measures based on the unadjusted forecasting errors are

57 2. An Introduction to Innovation State Space Models 57 clearly inappropriate for comparing multiple time series because they depend on the scaling of the units being measured (Hyndman and Koehler 2006). A commonly used adjustment is to convert these errors to absolute percentage errors (APEs) using p j = 100 e j / y j. This creates problems when y j is at or near zero. Should the mean or the median of the p j be used? The M-competition used the mean, and this was criticised by Gardner in the Armstrong and Lusk (1983) commentaries because the median will not be affected by outliers. However, the median only takes account of 50% of the p j (Davydenko and Fildes 2013). If a model performed well on just over half the withheld sample, but badly on the rest, the median would show that model performed well, while the mean would not. Furthermore, forecasts are usually more accurate at short horizons than at long horizons, so using the median could bias the results towards models that perform better at short horizons. The M3-competition used the symmetric APE, defined as sape j = 100 e j /( y j + ŷ j /2) = 200 e j / y j + ŷ j (Makridakis 1993). The sape avoids problems caused by zero or near-zero y j, and is supposedly symmetric because it gives the same result if y j and ŷ j are interchanged, while the APE does not. How-

58 2. An Introduction to Innovation State Space Models 58 ever, the sape is flawed because a positive error will have a greater sape than a negative error of the same magnitude for the same y j (Goodwin and Lawton 1999). For example, if y j = 100 and ŷ j = 50, e j = +50 and the sape is 67%. On the other hand, if y j = 100 and ŷ j = 150, e j = 50 and the sape is 40%. In both cases, the APE is 50%. Since negative predictions can occur even on positively-valued time series (see Chapter 6), the sape can also be affected by a zero problem. Relative absolute errors (RAE) are defined as RAE j = e j /e j, where e j is the error from a benchmark model, usually a random walk (Armstrong and Collopy 1992). Since e j can be very close to zero, this suggestion does not work well either (Hyndman and Koehler 2006). Because of rounding, even notionally continuous data can have two consecutive points that are the same, and under these circumstances the error for a random walk will be zero. Winsorizing can be used to trim extreme values, but this is an arbitrary method. An alternative to the RAE is to divide a performance measure for a particular method by the same performance measure for a benchmark method (Armstrong and Collopy 1992). A result less than one means the method

59 2. An Introduction to Innovation State Space Models 59 in question is better than the benchmark, while a greater than one result means the reverse. This performance measure can also be affected by the zero problem in the same way as the RAE. The Absolute Scaled Error (ASE) (Hyndman and Koehler 2006) is defined as: ASE j = 1 n 1 e j n t=2 y t y t 1, (2.10.1) where n is the length of the fitting sample, and t the fitting sample index. Both the APE and ASE divide by a scaling factor, but the two scaling factors are very different. The APE can look very poor for a series if a future observation y j is near zero. The ASE attempts to avoid this problem by averaging the absolute values of the one-step differences in the fitting sample, but we shall see in Chapter 8 that the scaling factor for the ASE is flawed. Chapter 8 extends Davydenko and Fildes (2013), who noted that the MASE would have a zero problem if the fitting sample was very flat. All measures considered in this section are only useful for point forecasts, and give no information about the distribution. Distribution-based measures will be introduced later in this thesis.

60 2. An Introduction to Innovation State Space Models 60 In this thesis we have used the mean APE and mean ASE (MAPE and MASE) for point prediction measures because the MAPE is commonly used (Fildes and Goodwin 2007), and the MASE is a recent innovation that was used in Hyndman et al. (2008a). The MAPE is defined as: MAPE = 100 h h j=1 ( ) ej. (2.10.2) y j Both the MASE and MAPE should be used only to give model prediction performance statistics based on a large number of series; neither measure performs well in actually selecting the model for a particular series. This was shown in Chapter 7 of Hyndman et al. (2008a), where model selection methods based on prediction validation do poorly. Prediction validation performed well in Fildes and Petropoulos (2015). However, this study was based on the 998 M3 monthly series with a total length of 126 or more observations, and did not attempt prediction validation on shorter time series; as we can see from Table 2.2, most of the M3 series are much shorter than 126 total observations, including withheld data.

61 2. An Introduction to Innovation State Space Models An Example Figure 2.1 shows the US employment level in millions of persons from January 2000 to December 2011, as derived from the household survey conducted monthly by the US Bureau of Labor Statistics (BLS). Although the BLS applies seasonal adjustments, it is better for our purposes to not use seasonal adjustment as we want to examine seasonal models, and so the data used has not been seasonally adjusted. This selection is 12 years of monthly data, so it has 144 total observations. It is clear that there is a strong seasonal pattern in the data. We would also expect the employment level to grow over time, keeping up with population growth; this was true until about mid-2008, when the US economy began shedding jobs at a great rate as a result of the financial crisis at the end of this data, the employment level was still a long way below where it would have been had the financial crisis not occurred. In fitting this data, we first withhold the final 18 observations, so that the fitting sample goes from January 2000 to June 2010, and contains 126 observations. We now need to ask which model we should select using only the fitting sample. Using the estimation method described in Section 2.6, the maximum likelihood was

62 2. An Introduction to Innovation State Space Models 62 Fig. 2.1: US Employment Level, January 2000 to December 2011, not seasonally adjusted. Source: St Louis Fed, fred2/series/lnu /, retrieved 22 August 2014

63 2. An Introduction to Innovation State Space Models 63 found for all six models. Log likelihoods, number of free parameters and AICs are given in Table 2.3. Tab. 2.3: Six Models for US employment data Model logl npar AIC Local level Local trend Damped trend Seasonal level Seasonal trend Seasonal damped trend Table 2.3 shows that the seasonal model with a damped local trend should be selected because it has the lowest AIC. Applying this model to the fitting sample gives an estimated standard deviation of about 370,000 persons, with ˆα = 0.66, ˆβ = 0.40, ˆγ = and ˆφ = If we now apply the forecasting methods in Section 2.7 to the 18 withheld periods, we get the graph shown in Figure 2.2. In this case, the point predictions are fairly close to the actual data, but the 90% prediction intervals become very wide, showing the considerable uncertainty of predicting a long way into the future.

64 2. An Introduction to Innovation State Space Models 64 Fig. 2.2: Employment Fits, Predictions and 90% PIs

65 2. An Introduction to Innovation State Space Models Other Time Series Methods In this section, we briefly discuss the Autoregressive Integrated Moving Average (ARIMA) models (Box and Jenkins 1970), the Kalman filter (Kalman 1960), the Multiple Sources of Error (MSOE) state space approach (Duncan and Horn 1972; Harrison and Stevens 1976), and the Theta method (Assimakopoulos and Nikolopoulos 2000). Arguments are advanced for the advantages of the innovations state space approach in relation to the other methods ARIMA ARIMA is also known as the Box-Jenkins approach. A general ARIMA(p, d, q) model can be defined as: φ(l)(1 L) d y t = θ(l)e t. (2.12.1) Here, L is the lag operator, and is such that L i y t = y t i. The first part of the left hand side of (2.12.1), φ(l), is a polynomial of degree p called the

66 2. An Introduction to Innovation State Space Models 66 autoregressive part, the second part is the number of differences required, and the right hand side of (2.12.1), θ(l), is a polynomial of degree q called the moving average part, so that θ(l)e t = (θ 0 + θ 1 L + θ 2 L θ q L q )e t = θ 0 e t + θ 1 e t 1 + θ 2 e t θ q e t q. The φ i are autoregressive parameters, and the θ i are moving average parameters. A local level model can be converted to an ARIMA(0,1,1) model, by first rewriting (2.2.1) as: y t = l t 1 + e t ; (1 L)l t = αe t. (2.12.2) Applying the operator (1 L) to the first part of (2.12.2) gives: (1 L)y t = (1 L)l t 1 + (1 L)e t ; = αe t 1 + e t e t 1 ; = (1 θ 1 L)e t. This is now an ARIMA(0,1,1) model with θ 1 = 1 α. Similarly, a local trend model is equivalent to an ARIMA(0,2,2) model, and a damped trend model is equivalent to an ARIMA(1,1,2) model (Gardner and McKenzie 1985). The ARIMA model formulation in (2.12.1) does not include seasonal components.

67 2. An Introduction to Innovation State Space Models 67 A seasonal ARIMA model can be formulated for m seasons using capital letters to represent the seasonals (see Ahmad, Khan and Parida, 2001). In shorthand, it is referred to as an ARIMA(p, d, q)(p, D, Q) m model, and takes the form: φ(l)φ(l m )(1 L) d (1 L m ) D y t = θ(l)θ(l m )e t. (2.12.3) The Φ and Θ components are lagged m periods, as are the seasonal differences. All additive state space models are special cases of the ARIMA models, but multiplicative state space models are not special cases of ARIMA (Gardner 2006). When a series is differenced, the new series will have one fewer observation than the original series, or m fewer observations for seasonal differences. As a result, information is lost, and the log likelihood based on n 1 observations cannot be easily compared with a log likelihood based on n observations; this also means that AICs cannot be easily compared. No information is lost for the state space models, so the state space formulation is preferable. More information on the ARIMA models can be found in Box, Jenkins and Reinsel (1994).

68 2. An Introduction to Innovation State Space Models Kalman Filter The Kalman filter uses a matrix decomposition method to effectively perform exponential smoothing, but with a random seed state. However, this filter method only performs well on stationary additive models, and none of the models used in this thesis are stationary. An example of a stationary model would be a damped level model, in which l t 1 in (2.2.1) is replaced with φl t 1, 0 < φ < 1. The information filter presented in Hyndman et al., Chap 12 (2008a) is similar to the Kalman filter, but can cope with nonstationary models. However, neither filter can be easily applied to multiplicative models. The use of either filter requires a type of Gaussian elimination known as fast Givens transformations (Stirling 1981). In standard Gaussian elimination, we solve the matrix equation Ax = b by converting the matrix A into a unit upper triangular matrix. For the fast Givens transformations, the vector b becomes augmented with another vector representing the variance, and we solve the augmented matrix equation Ax = [µ v] for both mean and variance of x. Operations involving µ are as usual for Gaussian elimination, but operations involving v use rules for the addition and multiplication of

69 2. An Introduction to Innovation State Space Models 69 variances. A two equation representation of a fast Givens transformation is: x 1 x 2 µ v a 11 a 12 µ 1 v 1 a 21 a 22 µ 2 v 2 Use of Gaussian elimination on the elements of A will allow us to solve for both the mean and variance of x. Both filters use a recursive process to process observations, with both a prediction and a revision step. For the information filter, non-stationary time series models can be initialised with an infinite variance (no prior information), but the Kalman fiter does not allow these initial settings for non-stationary time series models, and is thus restricted to stationary time series models. More information on the Kalman and information filters is presented in Chapter 12 of Hyndman et al. (2008a) Multiple Sources of Error (MSOE) Models This thesis uses innovations models that are also known as Single Source of Error (SSOE) models, in which the transition equation uses ge t, where g is

70 2. An Introduction to Innovation State Space Models 70 a vector of smoothing parameters, and e t is the only error source. The state space form for Multiple Sources of Error (MSOE) models is: y t = w x t 1 + e t ; e t η t x t = Fx t 1 + η t ; NID 0 0, V e V ηe V eη V η. (2.12.4) Here, η t is a vector of normally distributed components. MSOE and SSOE forms of common additive models are given on page 211 of Hyndman et al. (2008a). On the surface, the MSOE model appears to be more general than the SSOE model, because the random error components should allow for more combinations than the vector components of the SSOE framework. However, Chapter 13 of Hyndman et al. (2008a) shows that in fact the opposite is true, and that the MSOE forms are nested within the SSOE forms, as the SSOE forms allow for a greater parameter space. The proof of this is that any MSOE or SSOE model can be expressed in ARIMA form. Any ARIMA model can be represented as a SSOE model, but not all ARIMA models can be expressed as MSOE models. Because the SSOE framework is much

71 2. An Introduction to Innovation State Space Models 71 simpler to work with than the MSOE framework, it will be used in this thesis. The MSOE framework also cannot be applied easily to multiplicative models The Theta Method The Theta method for time series was proposed by Assimakopoulos and Nikolopoulos (2000), and it performed well on the M3-competition data. The original paper had complex algebra, but in a paper titled, Unmasking the Theta method, Hyndman and Billah (2003) show that the Theta method is, with appropriate seasonal adjustments, effectively a local level model with an additional constant trend term; this can be achieved by setting β = 0 in the local trend model. As a result, the Theta method is a special case of exponential smoothing. For a recent discussion of the Theta method and extensions of it, see Fioruci, Pellegrini, Louzada and Petropoulos (date unknown). In the Theta method, the constant trend is half the slope given by simple linear regression on the time series. This half term is the result of averaging two Theta-lines in the original paper; θ = 0 is equivalent to simple linear regression, and θ = 2 is equivalent to 2 the curvature of the series. In this

72 2. An Introduction to Innovation State Space Models 72 context, the curvature is mathematically equivalent to the fitted values from a local level model.

73 3. AN EVALUATION OF THE PERFORMANCE OF EXPONENTIAL SMOOTHING METHODS 3.1 Introduction The models used for exponential smoothing include the additive models described in Chapter 2. However, sometimes the additive models are too restrictive, and we would like to multiply two components, rather than add them. A method that used multiplicative seasonality was first proposed by Winters (1960); this is called the Holt-Winters multiplicative seasonality method. Methods that used both multiplicative trend and seasonality were introduced by Pegels (1969). A method that utilised the AIC to select from 24 models, including multiplicative models, was used in Hyndman et al. (2002); this method compared well with other methods. Taylor (2003) proposed the multiplicative damped trend method, which slightly outperformed

74 3. An Evaluation of the Performance of Exponential Smoothing Methods 74 the additive damped method for the monthly M3 data. Historically, multiplicative exponential smoothing methods have not been concerned with the error term, because, given the same seed states and smoothing and damping parameters, the point forecasts are the same for additive and multiplicative error models. Including seasonality, there are a total of 30 Error, Trend, Seasonal (ETS) models (Hyndman et al., 2008a, Chap 2). Each letter in the ETS classification system corresponds to that particular model s error, trend or seasonality, so that any of the 30 models can be described using three letters and a possible subscript used to denote damping. The codes are: E Error: either A for Additive Error, or M for multiplicative error. T Trend: N for no trend, A for additive trend, A d for damped additive trend, M for multiplicative trend or M d for damped multiplicative trend. S Seasonality: N for no seasonality, A for additive seasonality or M for multiplicative seasonality. A local level model with additive errors is an ETS(A,N,N) model. A model

75 3. An Evaluation of the Performance of Exponential Smoothing Methods 75 with multiplicative trend, additive seasonality and multiplicative errors would be an ETS(M,M,A) model. A model with additive damped trend, multiplicative seasonality and additive errors would be an ETS(A,A d,m) model. The results of applying the 30 ETS models to the M3-competition data (Makridakis and Hibon 2000) are given in Chapter 7 of Hyndman et al. (2008a). However, this study used only a fixed origin method, and no account was taken of measures based on probability distributions; only point predictions were used. This chapter presents more information by the use of distribution-based measures, uses a rolling origin method, and summarises the results, with ANOVA used to test for statistically significant differences between the various models. We want to see if distribution-based methods give different results to point prediction methods. Section 3.2 shows how parameters of the various models are estimated, and gives methods for model selection. Section 3.3 describes the distributionbased future log likelihood. The rolling origin method is described in Section 3.4. The experimental design and results are presented in Sections 3.5 and 3.6. Finally, conclusions are made in Section 3.7.

76 3. An Evaluation of the Performance of Exponential Smoothing Methods Estimation and Model Selection The eight cases of the damped trend model that involve seasonality are given in Table 3.1. The l t, b t and s t are respectively the level, trend and seasonal states, and the α, β and γ are the associated smoothing parameters, with φ the damping factor. µ t is the one-step ahead prediction. Table 3.1 shows the eight most general cases of the ETS models, and simpler models can be easily derived from the general cases, as shown in Table 2.1. The measurement and transition equations of all 30 ETS models are tabulated on pages 21 and 22 of Hyndman et al. (2008a). Additive error models use e t = y t µ t, but multiplicative error models use relative errors defined as ε t = (y t µ t )/µ t = e t /µ t. As a consequence, if both y t and µ t are positive, the range of ε t will be ( 1, ).

77 Tab. 3.1: Various Forms of the Damped Trend Model Error Trend Seasonality A = Additive M = Multiplicative A A d A M d l t = l t 1 + φb t 1 + αe t b t = φb t 1 + βe t µ t = l t 1 + φb t 1 + s t m s t = s t m + γe t l t = b t = l t 1 b φ t 1 + αe t b φ t 1 + βe t /l t 1 µ t = φ l t 1bt 1 + s t m s t = s t m + γe t M A d M M d l t = l t 1 + φb t 1 + αµ t ε t b t = φb t 1 + βµ t ε t µ t = l t 1 + φb t 1 + s t m s t = s t m + γµ t ε t l t = b t = l t 1 b φ t 1 + αµ t ε t b φ t 1 + βµ t ε t /l t 1 µ t = φ l t 1bt 1 + s t m s t = s t m + γµ t ε t µ t = (l t 1 + φb t 1 )s t m l t = l t 1 + φb t 1 + αe t /s t m b t = φb t 1 + βe t /s t m s t = s t m + γe t /(l t 1 + φb t 1 ) µ t = l t 1 b φ t 1s t m l t = l t 1 b φ t 1 + αe t /s t m b t = b φ t 1 + βe t /(s t m l t 1 ) s t = s t m + γe t /(l t 1 b φ t 1) µ t = (l t 1 + φb t 1 )s t m l t = (l t 1 + φb t 1 )(1 + αε t ) b t = φb t 1 + β(l t 1 + φb t 1 )ε t s t = s t m (1 + γε t ) µ t = l t 1 b φ t 1s t m l t = l t 1 b φ t 1(1 + αε t ) b t = b φ t 1(1 + βε t ) s t = s t m (1 + γε t ) 3. An Evaluation of the Performance of Exponential Smoothing Methods 77

78 3. An Evaluation of the Performance of Exponential Smoothing Methods 78 The traditional restrictions were used for the smoothing and damping parameters. These restrictions are: 0 α 1, 0 β α, 0 γ 1 α and 0 < φ 1. Putting φ = 1 means that the damped trend model becomes equivalent to a local trend model. For additive seasonality, the initial m seasonal indices must sum to zero; for multiplicative seasonality they must sum to m, and must all be greater than zero; the sum to m restriction is so that the mean of the initial seasonal indices is one. For multiplicative trend models, the initial trend term must be greater than zero. In multiplicative trend and seasonal models, the b t and s t are multipliers of the level. From Table 3.1, when b φ t 1s t m = 1, the new one-step ahead prediction will be the same as the current level. If b φ t 1s t m > 1, the new prediction will be larger than the current level, and if b φ t 1s t m < 1, the new prediction will be smaller. In general, multiplicative models are not recommended for time series that have negative or zero observations because the iteration procedure can then become numerically unstable. However, all of the M3 data are positive. We seek estimates of all smoothing and damping parameters and seed states that maximise the log likelihood, subject to parameter restrictions. As ex-

79 3. An Evaluation of the Performance of Exponential Smoothing Methods 79 plained in Chapter 2, for additive error models, the sample log likelihood can be calculated given the estimated standard deviation of the errors, namely, ˆσ, and is given by: log L = n (1 + log(2π) + 2 log(ˆσ)). (3.2.1) 2 For multiplicative error models, as explained in the next chapter, we calculate the estimated standard deviation of the relative errors, s, then add an adjustment term that includes µ t to Equation (3.2.1) to get: log L = n 2 (1 + log(2π) + 2 log(s)) log( µ t ). (3.2.2) Additive error models are homoscedastic; this means that the one-step ahead variance does not depend on time. Multiplicative error models, on the other hand, are heteroscedastic (Greene, 2003, Chap 11), and the variance of a multiplicative error model depends on the fitted value at time t, such that Var t = s 2 µ 2 t. A large µ t will thus produce a large variance for multiplicative error models.

80 3. An Evaluation of the Performance of Exponential Smoothing Methods 80 Point forecasts for all models can be generated by assuming that e t = 0 for additive error models and ε t = 0 for multiplicative error models in the formulae in Table 3.1, and then iteratively applying these formulae until all desired predictions are filled. The µ t in the table then become the ŷ n+j used for point forecasts. We applied the Akaike Information Criterion (AIC) (Akaike 1974) in two different ways to select models. First, we selected using only the additive models (those without an M in the ETS formulation); this AIC calculation was known as AIC-add. Second, we selected using the AIC from all relevant ETS models (10 for annual data and 30 for monthly and quarterly data); this AIC is known as AIC-all. The two AIC methods were added to the models, giving a total of 12 models including AIC methods for annual data and 32 for quarterly and monthly data, as 20 seasonal models cannot be applied to non-seasonal data. In Chapter 7 of Hyndman et al. (2008a), many model selection methods were tested, and the AIC selection method performed well, so its use is justified. In this chapter, we want to see if model selection using all models is better than selecting using only the additive models; the Hyndman study found

81 3. An Evaluation of the Performance of Exponential Smoothing Methods 81 that selecting using all models was generally a little worse than selecting using only additive models. The AIC methods select the model that gives the lowest AIC on the fitting sample, and this model will not necessarily be the best at forecasting future time periods. 3.3 Future log Likelihood Point predictions are useful if only the mean is required. However, often measures of distribution will be needed, so that the probability of getting a particular observation can be ascertained. Here, the future log likelihood is introduced. The future log likelihood is known as the logarithmic score (Gneiting & Raftery 2007) and was applied for count time series in an exponential smoothing context in Snyder, Ord and Beaumont (2012); it can also be used for continuous series. The future log likelihood is defined for additive error models for horizon 1 as: futlogl = 0.5 ( log(2π) + 2 log(ˆσ) + e 2 /ˆσ 2), (3.3.1)

82 3. An Evaluation of the Performance of Exponential Smoothing Methods 82 where ˆσ is the estimated standard deviation of the errors up to time point n, and e is the point prediction error of the one-step ahead forecast. The future log likelihood expression for multiplicative error models is: futlogl = 0.5 ( log(2π) + 2 log(s) + ε 2 /s 2) log( µ ), (3.3.2) where µ is the one-step ahead prediction, and ε is the relative point prediction error of the one-step ahead forecast. The future log likelihood calculates the log density of an actual future observation, given what we know from the fitting sample. Note that with the MAPE and MASE measures, smaller is better, but with the future log likelihood, larger is better. In estimating the future log likelihood, it is necessary to obtain the density of y n+1 given information up to time n. Then the density of y n+2 given information up to time n + 1 can be estimated, and this procedure can be followed until the end of the withheld data. These results can then be averaged to give the overall future log L for a series. This brings us to the rolling origin method, discussed in the next section. Analytic methods of calculating forecast variance for lead times of more than one do not exist for some models with multiplicative components (Hyndman et al., 2008a,

83 3. An Evaluation of the Performance of Exponential Smoothing Methods 83 Chap 6), so it would be very difficult to calculate the future log L for these models for more than one step ahead; this is why the future log L is limited to one-step ahead forecasts throughout this thesis. We aim to maximise the future log likelihood, while minimising the MASE and MAPE. To avoid confusion in tables, we will use the negative of the future log likelihood to ensure that smaller is better for all measures; this will be known as the MLPL (minus log prediction likelihood). 3.4 The Rolling Origin Method The concept of the rolling origin (see Tashman, 2000; Snyder et al., 2012) can be explained using Table 3.2. For illustrative purposes, a prediction horizon of six is assumed. The cells marked with an X contain numbers that depend on the forecast criterion being calculated. Only the states are revised as we roll forward into the future; the damping and smoothing parameter estimates remain unchanged from those estimated by optimising over the original n periods. This method is useful because the MLPL cannot be applied for more than one period ahead as the MLPL is based on the one-

84 3. An Evaluation of the Performance of Exponential Smoothing Methods 84 step ahead forecast error. Also, as argued by Tashman (2000), the rolling origin method is better than the fixed origin method because there are more points of comparison for each forecast horizon. Tab. 3.2: Rolling Origin Illustration Lead Time Prediction Origin n X X X X X X n + 1 X X X X X n + 2 X X X X n + 3 X X X n + 4 X X n + 5 X Means µ 1 µ 2 µ 3 µ 4 µ 5 µ 6 Here is a step by step guide to how the rolling origin method works. x n represents the vector of states at time n. 1. We use x n to forecast h periods ahead. 2. Now, we record the actual value of y n Without re-estimating smoothing and damping parameters, we use y n+1 to calculate x n x n+1 can now be used to forecast h 1 periods ahead.

85 3. An Evaluation of the Performance of Exponential Smoothing Methods Repeat this process until no future observations are left. Note that it is a feature of this process that the forecasts rely progressively on the observed data, as it unfolds, but that the parameter estimates used only rely on the observations up to time n. This means that the one-step ahead forecasts will tend to become slightly less accurate as we move the origin away from n. Re-estimating all parameters at each step taken into the future would avoid this problem, but would require an excessive amount of computer time. Re-estimation would be feasible if we were only dealing with a few series, but we have 2829 total series here. Averages of particular lead times can now be calculated. For example, the 3-step ahead ASE will be the average of the 3-step ahead lead times for the 4 origins from n to n + 3. To get the overall result for a particular series, we calculate the average of horizon 1, average of horizon 2, etc. Then the overall result is an average of these averages. For the MLPL, only the first column in Table 3.2 is used, as the MLPL is a one-step ahead measure.

86 3. An Evaluation of the Performance of Exponential Smoothing Methods MASE The cells of Table 3.2 contain Xs that can represent the scaled absolute errors. The denominator of the MASE (see (2.10.1)) changes as the prediction origin is rolled forward. Thus, as we move the origin from period n to period n + 1 say, the MASE denominator is amended to reflect the effect of the new sample value y n Experimental Design The M3 data set has 645 annual series, 756 quarterly series and 1428 monthly series. Six observations are withheld for annual series, eight for quarterly and eighteen for monthly. Ten models could be applied to annual data, and 30 to quarterly and monthly data. In addition, two model selection methods were used: the AIC-add method selects using the AIC from only additive models, and the AIC-all method selects using all models. The following procedure was then used, with N the number of series in a particular frequency, and k the number of applicable models and methods (12 for annual data and 32 for seasonal data, including the AIC methods).

87 3. An Evaluation of the Performance of Exponential Smoothing Methods For each series, apply each of the 30 models (ten for annual data), and use maximum likelihood estimation to optimise all estimated parameter values, including seed states. 2. Use the rolling origin method to get results for each model for each future horizon period for the ASE and APE; for the MLPL, only the first horizon was used. 3. Average over each lead time to get one prediction measure result for each lead time for each series, as illustrated in Table Select the model that minimises the AIC on the fitting sample, both using only additive models and all models. AIC-add and AIC-all can now be added to the database. 5. The overall prediction measure for a series will then be an average of the prediction measures obtained in steps 3 and 4. This means that we now have an N k data table for each prediction measure. 6. A two-way ANOVA, with series as the row effect and method as the column effect, can be used to test for statistical differences between the various methods. For inference using ANOVA, we take the logs of the overall prediction measures in the N k table above. This is because

88 3. An Evaluation of the Performance of Exponential Smoothing Methods 88 the distribution of the original data s residuals is very non-normal, and this becomes much closer to normal when the data is logged. We cannot test for interaction between series and method using ANOVA, as we only have one entry per cell for each method and series, and interaction would thus reduce our error degrees of freedom to zero. The original M3-competition is described in Chapter 2. In that study, the performance of methods are compared using, among other measures, the symmetric MAPE. The methods used included the local level model, local trend model and damped trend model that are described in Chapter 2. However, these comparisons did not entail an evaluation of all 30 ETS models for the M3 data. The only previous work directly relevant to the comparisons made in this chapter is that found in Chapter 7 of Hyndman et al. (2008a). While we do not show the non-logged tables here, they can be found in the Appendix in Tables A.1 to A.3. In all cases, these tables show that the AIC-add method performs better than AIC-all. In comparing our results to the results in Hyndman et al. (2008a), which used a fixed origin method, we find that the ordering of models are roughly the same for the MASE and MAPE measures, but most MASEs and MAPEs given are slightly worse

89 3. An Evaluation of the Performance of Exponential Smoothing Methods 89 than those in the Hyndman et al. study. This is because we are not reestimating parameters at each iteration of the rolling origin, as was described in Section 3.4. Using ANOVA is justified because, while the M3 data is not a random sample from a population of time series, it does represent a large selection of real time series across a broad spectrum of types and contexts. Diagnostic graphs showed that the distribution of residuals from the logged data is approximately normal, though there are a few large positive outlier residuals owing to the difficulty of forecasting a few series. ANOVA has been previously used on the M3 quarterly data in the context of neural networks (Zhang and Kline 2007). Armstrong (2007) demonstrates that there are problems with statistical tests in the forecasting literature. There is, of course, a proper distinction to be made between practical importance of any difference, and the statistical significance of the corresponding test. In general terms, it is possible for a comparison to be statistically significant, but practically negligible, and also for the opposite: in a small study, a practically important difference may be found to be not statistically significant.

90 3. An Evaluation of the Performance of Exponential Smoothing Methods 90 The set of times series from the M3 competition used for the comparisons in this thesis cannot be considered small (with 645 annual series, 756 quarterly and 1428 monthly), so the conclusion is drawn that it is unlikely that materially important differences in the tests applied will be found to be statistically insignificant. Formally, the tests applied in this thesis cannot demonstrate definitively whether one method is genuinely a big improvement on another method, as opposed to just being statistically better. However, the tests show when there is not a significant difference, and, due to the substantial number of series used for analysis, it is unlikely that there would be a big improvement when two methods have no statistically significant difference. In this chapter, and for the thesis generally, we have focused on the average of the logs of performance measures. We could have used other measures, such as the number of series that a particular method performed best on (series wins measure), or the average rank of a method. The average rank measure orders each method for a series from 1 (best) to k (worst), then these ranks are averaged across the series.

91 3. An Evaluation of the Performance of Exponential Smoothing Methods 91 The problem with the series wins measure is that, as 32 methods are applied to seasonal data, a method which performs best on about 10% of series could seem to be the best method on seasonal data, even if that same method performed worst on 20% of series. Ranks are better than simply assessing the number of series wins, but do not account for the degree by which one method beats another on a particular series. Tables of mean ranks for annual, quarterly and monthly data have been included in the Appendix (Tables A.4 to A.6). 3.6 Results Tables 3.3 to 3.5 give grouping information based on pairwise testing between methods, adjusted for multiple comparisons using Tukey s method, after applying a two-way ANOVA. In each table, the mean of each method is given, along with a suffix letter or letters corresponding to that method s grouping. For example, all methods with a suffix of a are not statistically different from each other, at the 5% level for the family-wise error rate, and similarly for all suffixes. Methods that do not share a common letter are statistically different at the 5% level after adjustment for multiple compar-

92 3. An Evaluation of the Performance of Exponential Smoothing Methods 92 isons. The methods for each prediction measure are sorted from highest to lowest mean, with lowest mean best. Summaries of the tabulated data for each frequency are given. In all data tables used in this section, and throughout the thesis generally, smaller values are better. As a result, the use of better or best simply means a lower or lowest value of a performance measure. All data tables in this section, and in later chapters, use rolling origin statistics. Tab. 3.3: ETS Models with AIC for Annual Data Log MASE Log MAPE Log MLPL Model Mean Model Mean Model Mean M,M,N 0.840a M,M,N 2.588a A,M d,n 2.275a A,M,N 0.805ab A,M,N 2.546ab A,A d,n 2.237ab M,N,N 0.799ab M,N,N 2.529abc A,M,N 2.228ab A,N,N 0.794ab AIC-all 2.526abc AIC-add 2.224ab AIC-all 0.782abc A,N,N 2.525abc AIC-all 2.203bc M,M d,n 0.758abcd M,M d,n 2.504abcd A,A,N 2.194bcd A,M d,n 0.751bcd A,M d,n 2.491bcd M,A d,n 2.156cde M,A,N 0.737bcd M,A,N 2.480bcd M,M d,n 2.143def M,A d,n 0.730bcd M,A d,n 2.472bcd M,A,N 2.127efg A,A,N 0.705cd A,A,N 2.448cd M,M,N 2.095f gh AIC-add 0.687d A,A d,n 2.428d A,N,N 2.079gh A,A d,n 0.687d AIC-add 2.427d M,N,N 2.061h For annual data, the best methods in point terms using MASE are the additive trend with damping and additive errors model (ETS(A,A d,n)) and AIC-add. These methods are significantly better than five other methods,

93 3. An Evaluation of the Performance of Exponential Smoothing Methods 93 which either have multiplicative trend without damping, no trend, or use all models for selection. Using the MAPE produces a similar result, but the MLPL produces a very different conclusion. For the MLPL, five of the top six models are multiplicative error models. The multiplicative error local level model, ETS(M,N,N), is significantly better than nine of the eleven other methods. The AIC methods are nowhere near the best methods for MLPL. For quarterly data, shown in Table 3.4, the AIC-add method is the best in point terms for both MAPE and MASE, with four models that have trend damping with seasonality coming close behind. These five methods all perform significantly better than many simpler models that have no damping or seasonality, though the AIC-all method is not significantly worse than AICadd for quarterly MASE and MAPE. However, for the MLPL, the four best methods are multiplicative error models, and the two best have seasonality but no trend. Once again, neither AIC method performs well for MLPL. For monthly data, shown in Table 3.5, the AIC-all method is best in point terms for the MAPE and MASE, with AIC-add and four models that have trend damping with seasonality also performing well. For MLPL, three models with no additive components perform best; neither AIC-all nor AIC-add

94 3. An Evaluation of the Performance of Exponential Smoothing Methods 94 Tab. 3.4: ETS Models with AIC for Quarterly Data Log MASE Log MAPE Log MLPL Model Mean Model Mean Model Mean M,M,N 0.653a M,M,N 1.978a A,M d,a 2.007a M,N,N 0.629ab A,M,N 1.943ab A,M d,m 2.005a A,N,N 0.624abc M,N,N 1.941ab A,M d,n 2.001ab A,M,N 0.621abc A,N,N 1.936ab A,M,N 2.000ab M,A,N 0.577abcd M,A,N 1.899bc A,M,A 1.999ab M,M d,n 0.568bcd M,M d,n 1.890bcd A,M,M 1.994ab A,M d,n 0.563bcd A,M d,n 1.884bcd A,A d,m 1.993ab A,A,N 0.548cde A,A,N 1.868bcde A,A d,a 1.993ab M,A d,n 0.530de M,A d,n 1.850cde A,A,A 1.988abc A,A d,n 0.527de A,A d,n 1.846cde A,A d,n 1.986abcd A,N,A 0.517def M,M,M 1.835cdef A,A,M 1.986abcd M,M,M 0.515def M,M,A 1.834cdef AIC-add 1.981bcde M,N,A 0.514def A,N,A 1.830cdef g A,A,N 1.979bcdef M,M,A 0.513def M,N,A 1.827cdef g AIC-all 1.978bcdef M,N,M 0.51def g M,N,M 1.824cdef g A,N,N 1.970cdef A,N,M 0.507defg A,N,M 1.821defg M,M d,n 1.969cdef A,M,A 0.500def gh A,M,A 1.819def gh M,M,N 1.967cdef g A,M,M 0.481efghi A,M,M 1.799efghi M,M d,a 1.966cdefg M,A,A 0.448fghij M,A,A 1.768fghij M,A d,a 1.964defg A,M d,a 0.444fghij A,M d,a 1.763fghij M,M d,m 1.962efg M,A,M 0.443f ghij M,A,M 1.762f ghij M,M,A 1.961ef g M,M d,a 0.435ghij M,M d,a 1.755ghij M,A d,n 1.961efg A,A,A 0.434ghij A,A,A 1.753ghij M,N,N 1.961ef g A,M d,m 0.426hij A,M d,m 1.742hij M,A d,m 1.960efg AIC-all 0.420ij AIC-all 1.739ij M,A,M 1.960ef g A,A,M 0.414ij M,M d,m 1.733ij M,A,N 1.960efg M,M d,m 0.413ij A,A,M 1.731ij A,N,M 1.959efg M,A d,a 0.390j M,A d,a 1.710j A,N,A 1.958efg A,A d,m 0.389j M,A d,m 1.708j M,A,A 1.958efg M,A d,m 0.389j A,A d,m 1.707j M,M,M 1.957fg A,A d,a 0.388j A,A d,a 1.706j M,N,M 1.946g AIC-add 0.385j AIC-add 1.703j M,N,A 1.946g

95 3. An Evaluation of the Performance of Exponential Smoothing Methods 95 Tab. 3.5: ETS Models with AIC for Monthly Data Log MASE Log MAPE Log MLPL 6 excluded Model Mean Model Mean Model Mean M,M,N 0.477a M,M,N 2.376a M,A d,a X A,N,N 0.463ab A,N,N 2.363ab M,A,A X M,N,N 0.463ab M,N,N 2.361ab M,M,A X M,A,N 0.456abc M,A,N 2.356abc M,M d,a X A,M,N 0.442abcd M,A d,n 2.337abcd M,N,A X M,A d,n 0.439abcd A,M,N 2.331abcd M,A,N X M,M d,n 0.423bcde M,M d,n 2.322bcd A,M d,n 1.988a A,A,N 0.416bcdef A,A,N 2.308cde A,M,N 1.987a A,M d,n 0.412cdefg A,M d,n 2.304def A,N,N 1.986a A,A d,n 0.407defgh A,A d,n 2.300def A,A d,n 1.985ab A,N,A 0.386ef ghi A,N,A 2.271ef g A,A,N 1.984abc M,N,A 0.375fghij M,N,A 2.259fgh A,A d,m 1.984abc A,M,A 0.365ghijk M,N,M 2.250ghi A,A,M 1.984abcd M,N,M 0.365ghijk A,M,A 2.250ghi A,M d,m 1.984abcd A,N,M 0.364hijk A,N,M 2.250ghi A,M,M 1.983abcde A,A,A 0.352ijkl A,A,A 2.238ghij A,N,M 1.981abcdef M,M,A 0.342ijklm M,M,A 2.224ghijk A,M d,a 1.981abcdef M,M,M 0.338jklmn A,M d,a 2.223ghijk A,A d,a 1.981abcdef A,M d,a 0.337jklmn M,M,M 2.222hijk A,A,A 1.980abcdef M,A,A 0.336jklmn M,A,A 2.222hijk A,M,A 1.980abcdef M,M d,a 0.333jklmn M,M d,a 2.216hijk A,N,A 1.978bcdef A,M,M 0.332jklmn A,M,M 2.214hijk M,A d,n 1.978bcdef A,A d,a 0.326klmn A,A d,a 2.210hijk M,N,N 1.977cdefg A,A,M 0.322klmn M,A d,a 2.204ijk M,M,N 1.976defgh M,A d,a 0.319klmn A,A,M 2.200jk M,M d,n 1.975efghi M,A,M 0.314lmn M,A,M 2.196jk AIC-add 1.973f ghij M,M d,m 0.310lmn M,M d,m 2.194jk M,A d,m 1.970ghij A,A d,m 0.310lmn AIC-add 2.192jk M,A,M 1.968hij A,M d,m 0.305lmn A,A d,m 2.188k AIC-all 1.968hij M,A d,m 0.304mn A,M d,m 2.188k M,M,M 1.967ij AIC-add 0.301mn M,A d,m 2.187k M,M d,m 1.967ij AIC-all 0.294n AIC-all 2.179k M,N,M 1.967j

96 3. An Evaluation of the Performance of Exponential Smoothing Methods 96 are significantly different from the best models. For all data frequencies, on MASE and MAPE performance measures, the individual selection used by AIC-add was not significantly worse than any aggregate model, and it is the best overall method in point terms for quarterly MASE and MAPE and annual MAPE, and tied for best on annual MASE. On the MLPL, simple aggregate models do best, but these models are not recommended in general. They perform well on the MLPL measure for reasons discussed in Section 3.6.2, but are otherwise worse than more complex models. Problems with the MLPL can occur for multiplicative error models if the point prediction is very close to zero, because the variance of a multiplicative error model is proportional to the square of the point prediction. Thus, if µ t is near zero, then V t will also be near zero. If the actual observation is not near zero, the MLPL for that point will be very large. This problem occurred for a few monthly time series that had multiplicative error models fitted with either additive trend or seasonality. To reduce this problem, we excluded six models from the monthly MLPL ANOVA analysis. This problem can be resolved by putting a non-zero constant c into the multiplicative error

97 3. An Evaluation of the Performance of Exponential Smoothing Methods 97 variance term. This type of model is investigated in the next chapter Spearman s ρ Correlations for the Performance Measures Table 3.6 gives Spearman s ρ correlations for the three performance measures used in this chapter. If N is the total number of points, the Spearman s ρ statistic ranks both the x and y variables from 1 (smallest) to N (largest), and then applies Pearson correlation to the x and y ranks of each point. As it uses ranks, outliers have much less influence on this statistic than if the usual Pearson correlation was used. If the data is either monotonically increasing or decreasing, even in a nonlinear way, Spearman s ρ will give a perfect correlation (±1). The Pearson statistic only gives a perfect correlation if all of the data is on a straight line. Spearman s ρ is not affected by the scaling of the data, i.e., it does not matter whether the data is logged. While correlations between MAPE and MLPL are high in all cases, correlations between MASE and either MAPE or MLPL are weak for quarterly data, and negative for monthly data. The reasons for this will be explained

98 3. An Evaluation of the Performance of Exponential Smoothing Methods 98 Tab. 3.6: Spearman s ˆρ Correlations for the Performance Measures Annual Correlations MASE MLPL MAPE MLPL Quarterly Correlations MASE MLPL MAPE MLPL Monthly Correlations MASE MLPL MAPE MLPL in Chapter Why MLPL and Point Prediction Measures Differ We can see from the tables that, particularly for annual data, the best models using the MLPL measure differ markedly from the best models using the point prediction measures. One reason for this could be that the MLPL uses only one-step ahead forecasts, while point prediction measures use multistep ahead forecasts. To address this issue, Table 3.7 recalculates the point prediction measures using one-step ahead forecasts.

99 3. An Evaluation of the Performance of Exponential Smoothing Methods 99 Tab. 3.7: Annual Data with One-Step Ahead Point Prediction Measures. that Logs are being used Note Log ASE 1 Step Log APE 1 Step Log MLPL Model Mean Model Mean Model Mean AIC-all 0.042a AIC-all 1.853a A,M d,n 2.275a M,M d,n 0.037ab M,M d,n 1.848a A,A d,n 2.237ab M,N,N 0.032ab M,N,N 1.824ab A,M,N 2.228ab A,N,N 0.019ab M,A d,n 1.823ab AIC-add 2.224ab M,A d,n 0.016ab M,M,N 1.819ab AIC-all 2.203bc M,A,N 0.005ab A,N,N 1.813ab A,A,N 2.194bcd M,M,N 0.004ab M,A,N 1.812ab M,A d,n 2.156cde A,A d,n 0.014ab A,A d,n 1.792ab M,M d,n 2.143def A,A,N 0.019ab A,A,N 1.789ab M,A,N 2.127ef g A,M d,n 0.020ab A,M,N 1.788ab M,M,N 2.095fgh A,M,N 0.021ab A,M d,n 1.788ab A,N,N 2.079gh AIC-add 0.029b AIC-add 1.775b M,N,N 2.061h Compared to Table 3.3, there is less statistical difference between the various models as measured by the point prediction measures. Although model orderings are somewhat different, the AIC-add method and ETS(A,A d,n) model perform well across all point prediction measures, yet are significantly worse than the two simplest models for the MLPL measure. Thus, MLPL measuring one-step ahead does not explain the discrepancy. To investigate further, we used two models: the ETS(A,N,N) and ETS(A,A d,n) models. For each series, both models had their MLPL and one-step ahead ASE (ASE1) recorded. We then calculated d = log(a, N, N) log(a, A d, N)

100 3. An Evaluation of the Performance of Exponential Smoothing Methods 100 for both ASE1 and MLPL (negatives imply that the ETS(A,N,N) model is doing better). Results are plotted in Figure 3.1, with ASE1 on the x axis and MLPL on the y axis. While the ASE1 results look symmetrical about zero, and range from about 3 to +3, the MLPL results range from 3 to +1. There are clearly many more large negative MLPLs than large positive MLPLs. Further analysis of the series with the lowest MLPL d (pointed out on the graph) shows that, for the series in question, there is a steep drop after the end of the fitting sample, as shown in Figure 3.2. Both models miss this drop, but ETS(A,A d,n) misses by much more than ETS(A,N,N) because the optimum values of smoothing and damping parameters calculated using the fitting sample for the ETS(A,A d,n) model make this model effectively behave like a simple regression. Furthermore, the variance of the fitted errors is much greater for ETS(A,N,N) than for ETS(A,A d,n). As a result, there is a greater penalty for very poor predictions for the ETS(A,A d,n) model than for the ETS(A,N,N) model; this explains why ETS(A,N,N) is significantly better than ETS(A,A d,n) using the MLPL measure of future probability density. If a multiplicative error model gave the same forecasts as shown in Figure 3.2,

101 3. An Evaluation of the Performance of Exponential Smoothing Methods 101 Fig. 3.1: d = log(a, N, N) log(a, A d, N) for both ASE1 and MLPL

102 3. An Evaluation of the Performance of Exponential Smoothing Methods 102 Fig. 3.2: Observations and Predictions for Two Models for Annual Series 10

103 3. An Evaluation of the Performance of Exponential Smoothing Methods 103 the MLPL would be reduced because the bigger forecasts lead to a greater variance, and thus the actual observation has a greater probability density. However, a multiplicative error model can be problematic when the forecasts are near zero. Figure 3.2 also shows that it is impractical to use simulation to calculate the MLPL for more than one step ahead when the distribution of the model cannot be analytically determined. Some series have future observations that are simply unpredictable given the fitting data. But since the probability of obtaining a realised future observation must be greater than zero, a simulation procedure that attempted to account for such unpredictable observations would require far too many individual simulations to be computationally practical. 3.7 Conclusions Except for the annual and quarterly MLPL, the AIC-add method is not significantly worse than any other method, and is never significantly worse than the AIC-all method. The AIC-all method is well behind AIC-add on

104 3. An Evaluation of the Performance of Exponential Smoothing Methods 104 point prediction measures except for monthly data, and is significantly worse than AIC-add on annual MASE and MAPE. As a result, AIC-add should be preferred to AIC-all. In general, more complex models with trend damping and seasonality, where applicable, do better on point prediction measures, but the models that are simpler, but with multiplicative error components do best on MLPL; this is most clear on annual data, but also occurs to a lesser extent on quarterly and monthly data. This discrepancy happens because multiplicative error models adjust their variance depending on the current one-step-ahead prediction, µ t, and it appears that the series volatility is higher when µ t is high.

105 4. DATA TRANSFORMS WITH EXPONENTIAL SMOOTHING METHODS OF FORECASTING This chapter is based on a paper by the author published in the International Journal of Forecasting, namely Beaumont (2014). 4.1 Introduction The exponential smoothing methods of forecasting from Chapter 2 were applied in the M3-competition (Makridakis and Hibon 2000) to 645 annual series, 756 quarterly series and 1428 monthly series. The exponential smoothing methods were remarkably successful, outperforming many other more sophisticated methods. One limitation of the study methods, however, is that they ignored the use of transforms to potentially improve forecasts.

106 4. Data Transforms with Exponential Smoothing Methods of Forecasting 106 Transforms should have been considered for two reasons. First, they are an indirect way to introduce multiplicative models, so, for instance, the log transform effectively turns multiplicative components into additive components. Second, series which have outliers may be better modelled by a distribution with fatter tails than the normal distribution; this applied to two of the error transforms. This chapter extends the exponential smoothing component of the M3-competition to consider the effect of the following transforms. 1. log transform 2. Box-Cox transform 3. Johnson error trend seasonal transform 4. heteroscedastic state space transform 5. t transform The first two transforms were applied to the series observations, and the remaining three were applied to the errors. While there is much existing literature on the series transforms (Nelson and Granger 1979; Thyer, Kuczera

107 4. Data Transforms with Exponential Smoothing Methods of Forecasting 107 and Wang 2002), the first two transforms on the errors presented in this chapter are original. The models to which exponential smoothing methods are applied here are: 1. Local level (Brown 1959) 2. Local trend (Holt 1957 republished 2004; Winters 1960) 3. Damped trend (Gardner and McKenzie 1985) 4. Seasonal with local level 5. Seasonal with local trend 6. Seasonal with damped trend These additive models were estimated using the maximum likelihood framework associated with the innovations state space model (see Chapter 2). The M3 methodology of evaluating predictions on withheld data was adopted and extended to move away from its emphasis on point forecasts, to an emphasis on prediction distributions as a whole. Predictions on withheld data were compared for the various transforms.

108 4. Data Transforms with Exponential Smoothing Methods of Forecasting 108 The restrictions on the smoothing and damping parameters α, β, γ and φ in Table 2.1 are 0 α 1, 0 β α, 0 γ 1 α and 0 < φ 1. The Akaike Information Criterion (AIC) (Akaike 1974) was used to select the best model from the six models in Table 2.1 for all transforms discussed here. From Chapter 2, the sample log likelihood for any transform can be estimated given the estimated standard deviation of the transformed errors ˆσ, and is given by: log L = n (1 + log(2π) + 2 log(ˆσ)) + J, (4.1.1) 2 where n is the sample size and J is the logged Jacobian. This is an adjustment that depends on the transform being considered; J = 0 when there is no transform. Excluding J, (4.1.1) is just the normal log likelihood. In this chapter, other than one parameter in one transform, all transforms optimise all parameters using maximum likelihood estimation, including λ in the Box- Cox transform. Models with a multiplicative component can also be used. However, as shown in Chapter 3, using the AIC to select from these 30 ETS models did not

109 4. Data Transforms with Exponential Smoothing Methods of Forecasting 109 produce results that were statistically better than if the AIC were restricted to selecting from the six additive models; for some performance measures and frequencies, the AIC-add was significantly better than AIC-all. The remainder of this chapter will explore the various transforms: this is separated into transforms on the series, and transforms on the errors. Then the ranked probability score prediction performance measure will be introduced. The experimental design will then be explained. Finally, results will be presented, and conclusions drawn. 4.2 Transforms on the series Series transforms involve a function of the form y = f(y) where f is a transform function, and y the series observations. Once the y are known, Equations (2.5.1) to (2.5.3) can be applied to the yt. The errors are assumed to be normally distributed in the transformed space. For forecasting purposes, the inverse transform function y = f 1 (y ) must also be known. To calculate the log Jacobian in Equation (4.1.1), f(y) is differentiated, and the results logged. Summing these results gives the logged Jacobian. Two series

110 4. Data Transforms with Exponential Smoothing Methods of Forecasting 110 transforms were used: the log transform and the Box-Cox transform The Log Transform In this transform, y = log(y) and the inverse transform is y = exp(y ). The Jacobian adjustment in Equation (4.1.1) is log(y t ). This is a simple transformation, but fairly restrictive. It cannot be applied to series with zero or negative values, but all the M3 series have only positive values The Box-Cox Transform As originally defined (Box and Cox 1964), the Box-Cox transform has: y = yλ 1, 0 λ 1. (4.2.1) λ The inverse transform is then: y = (λy + 1) 1/λ. (4.2.2)

111 4. Data Transforms with Exponential Smoothing Methods of Forecasting 111 However, this presents problems if (λy + 1) < 0 and λ < 1; this can happen when predicting after applying the Box-Cox transform. In this case, y can become complex, which does not make any sense. Therefore, for the Box-Cox inverse transform to make sense, it is defined as: y = sgn(λy + 1)( λy + 1 ) 1/λ. (4.2.3) This is defined for all y and is a one-to-one function, with no discontinuities. Equation (4.2.1) can also be defined for negative y; this transform now becomes: y = sgn(y) y λ 1. (4.2.4) λ When λ = 0, use of l Hôpital s rule on (4.2.1) gives y = log(y). When λ = 1, (4.2.1) becomes y = y 1; this acts as if there were no transform. The logged Jacobian of the Box-Cox transform in (4.1.1) is (λ 1) log y t. While the log transform is a limiting case of the Box-Cox transform, the latter does not always default to the former, and it is useful to consider them separately, especially as some analysts may routinely use the log transform.

112 4. Data Transforms with Exponential Smoothing Methods of Forecasting Transforms on the errors Transforms can also be applied to the errors; in these cases, the transformed errors are assumed to be normally distributed. In this section, three new transforms are considered. When point predictions are made, the errors are assumed to be zero, and thus the error transforms do not apply for point prediction purposes. However, because all parameters are re-estimated, smoothing and damping parameter estimates may change from the non-transformed results, causing point predictions to also change JETS Transform The Johnson Error Trend Seasonal (JETS) transform comes from combining the sinh transforms in Johnson (1949) with the ETS models in Hyndman et al. (2008a), and is the author s invention. Suppose we would like a distribution that can accommodate outliers. Let ε represent the error after adjustment for outliers, and e represent the error before adjustment. If we have ε N(0, σ 2 ), we would like the unadjusted error e to be such that e = ε when ε is near 0, and for e to become larger than ε as we move further

113 4. Data Transforms with Exponential Smoothing Methods of Forecasting 113 away from 0. In other words, if ε is near 0, de/dε should be near 1, and further away from 0, de/dε should become steeper. A function that satisfies these requirements is the sinh function with parameter κ, κ 0, and e can be expressed as: e = 1 sinh(κε) = 1 [exp(κε) exp( κε)]; κ 2κ de = cosh(κε) = 1 [exp(κε) + exp( κε)]; dε 2 (4.3.1) = 1 at ε = 0, as required. L Hôpital s rule can be used to show that e ε as κ 0 as also required. For a time series of length n the quantity e t is actually observed, and we then need to apply the inverse sinh transformation to get ε t which can then be used to smooth the next values of x t. In general state space terms, the formulation for the JETS transform is: e t = y t w x t 1 ; ε t = 1 κ sinh 1 (κe t ); x t = Fx t 1 +gε t ; (4.3.2) ε t N(0, σ 2 ).

114 4. Data Transforms with Exponential Smoothing Methods of Forecasting 114 The derivative of z = sinh 1 (x) is dz/dx = 1 / 1 + x 2. Since ε t is Gaussian, we can now find the pdf of e; this is f(e) = 1 ( σ 2π exp 1 ) 1 2σ 2 ε2. (4.3.3) 1 + (κe) 2 Substituting ε = 1 κ sinh 1 (κe) and using φ( ) to represent the standard normal pdf, we obtain the pdf of the Johnson distribution as: f(e) = 1 σ 1 + (κe) 2 φ ( sinh 1 (κe) κσ ). (4.3.4) The non-normal component of (4.3.3) implies that the logged Jacobian in (4.1.1) of the JETS transform is: J = 1 2 log ( 1 + (κet ) 2). (4.3.5) Figure 4.1 gives curves for the pdfs of both the standard normal distribution and the Johnson distribution with κ = 1. Using Equation 37 in Johnson (1949), we can calculate the mean and variance

115 4. Data Transforms with Exponential Smoothing Methods of Forecasting 115 Fig. 4.1: Comparison of Johnson and Normal Distributions

116 4. Data Transforms with Exponential Smoothing Methods of Forecasting 116 of the JETS distribution as E(e) = 0 and Var(e) = 1 2κ 2 ( exp(2κ 2 σ 2 ) 1 ). (4.3.6) Occasionally, the likelihood function can produce a κ that makes sense when the probability of getting an observation is considered, but where the variance of e can be extremely large, and therefore meaningless for prediction distribution purposes. To avoid this, κ is restricted such that the variance of e is less than 10V std, where V std is the variance of the equivalent no-transform model. The JETS transform can also be applied as a series transform. This may work well when the data have positives and negatives, but does not work well on the M3 data, which are all positive values. The JETS transform on the observations will therefore not be further considered in this chapter HSS Transform The Heteroscedastic State Space (HSS) transform (Hyndman et al., 2008a, Chapter 4, extended with personal communication from J. K. Ord in 2011)

117 4. Data Transforms with Exponential Smoothing Methods of Forecasting 117 can be used for both additive and multiplicative error models. Additive error models use e t = y t µ t, while multiplicative error models use ε t = (y t µ t )/µ t. The HSS transform includes a parameter λ that can correspond to either an additive error model, a multiplicative error model, or any model between these special cases. The HSS transform takes the basic form: µ t = w x t 1 ; e t = y t µ t ; x t = Fx t 1 +ge t ; (4.3.7) ε t = e t /( µt λ) ; ε t N(0, σ 2 ). Here, λ is the HSS parameter, with 0 λ 1. λ = 0 corresponds to no transform (additive error model), and λ = 1 corresponds to a multiplicative error model. We assume that ε t is normally distributed, and e t is a function of ε t, and is also normally distributed. Problems can occur in prediction measures when the prediction µ t becomes close to zero, as we saw in Chapter 3.

118 4. Data Transforms with Exponential Smoothing Methods of Forecasting 118 To avoid this problem, (4.3.7) was changed to have ε t = e t (c + µ t ) λ, (4.3.8) where c > 0 is a constant. This model has the variance of e t dependent on the expected mean µ t for λ > 0; hence, it is heteroscedastic. The formula for the variance is: Var(e t ) = (c + µ t ) 2λ σ 2. (4.3.9) This means that e t is normal with mean 0 and variance defined in (4.3.9). The logged Jacobian in (4.1.1) is J = λ log(c + µ t ). (4.3.10) To calculate c either a proportional c approach or an estimation approach was used.

119 4. Data Transforms with Exponential Smoothing Methods of Forecasting 119 HSS proportional c approach c is not estimated using maximum likelihood approaches; instead it is given as: c = 0.1 (mean of last N years of fitting sample). (4.3.11) N was six years for annual data, four years for quarterly data (16 observations and two years for monthly data (24 observations). The choice of 0.1 in (4.3.11) was made after some experimentation with alternative values; 0.2 was found to be too large, as it gave worse results than using 0.1. We do not want the constant to be too small, to prevent a near-zero (c + µ t ), so 0.1 was chosen. We used the mean of the last few fitting sample observations because this is more likely to correspond well to future series values. HSS estimation approach Equation (4.3.9) can have λ 0 and c, yet we can end up with the same variance we would get from less extreme values of λ and c. Because of these numerical problems, the variance of e t needed to be re-parameterized.

120 4. Data Transforms with Exponential Smoothing Methods of Forecasting 120 Here, the following was used: Var(e t ) = ˆσ 2 1(1 A + Aµ t ) 2λ. (4.3.12) This had the restrictions 0 A, λ 1. The conversion back to the c, ˆσ 2 formulation is c = (1 A)/A and ˆσ 2 = ˆσ 2 1/A 2λ Using the t distribution In the state space approach we have e t N(0, σ 2 ). An alternative is to replace the normal distribution with a t ν distribution (Harvey and Luati 2014), where ν is the degrees of freedom, and to assume that e t σt ν. Tables of the t distribution list only discrete ν, and if ν is discrete, then optimisation becomes difficult, as optimisation programs do not deal well with discrete inputs. However, there is actually no reason why we cannot have a continuous ν; the probability density function of a t ν distribution for standardised observation z is (Watson, 1983, pg 425): f(z) = 1 Γ( ν+1 νπ Γ( ν ) 2 2 ) ) ν+1 (1 + z2 2. (4.3.13) ν

121 4. Data Transforms with Exponential Smoothing Methods of Forecasting 121 This expression clearly does not need ν to be discrete. As a result, instead of assuming that e t N(0, σ 2 ), we would assume that e t σt ν, with ˆσ = ê2 t /n, as for the normal distribution. To calculate the log likelihood, we first need to divide the errors by ˆσ to get the standardised errors ẑ t. Because we are dividing by ˆσ, this also becomes part of the likelihood. ν is also estimated as part of this process. From Equation (4.3.13), and using a process similar to that described in Section 2.6, we can calculate the log likelihood as: ( 1 Γ( ν+1 log L = n log νπ Γ( ν ) 2 2 ) ) n log(ˆσ) ν n t=1 ( ) log 1 + ẑ2 t. (4.3.14) ν The equations for the t distribution will be the same as in (2.7.1) and (2.7.2), but using e t σt ν instead of e t N(0, σ 2 ). 4.4 Continuous Ranked Probability Score (CRPS) The ranked probability score is a distribution-based forecasting measure that measures the distance between distributions. This is different from the MLPL, which assesses the probability density of a distribution at a point.

122 4. Data Transforms with Exponential Smoothing Methods of Forecasting 122 The RPS was originally defined for only discrete distributions (Epstein 1969; Murphy 1971), and was used for count time series in an exponential smoothing context in Snyder, Ord and Beaumont (2012). The continuous ranked probability score is defined as (Gneiting and Raftery 2007): CRPS(y obs ; F ) = (δ(y > y obs) F (y)) 2 dy; = yobs F (y)2 dy + yobs (1 F (y))2 dy.. (4.4.1) Here, F (y) is the cumulative distribution function (cdf) of y and δ(y > y obs ) is the delta function, which is 1 when y > y obs and 0 otherwise. The CRPS is basically measuring the distance between one distribution and another. It can be calculated in three ways; the preferred method depends on what transform is being used. 1. When there is no transform, and only then, the following formula from Gneiting and Raftery (2007), can be used: [ ( ) y ˆµ CRPS = ˆσ 2φ ˆσ 1 [ ]+(y ˆµ) 2Φ π ( ) y ˆµ ˆσ ] 1. (4.4.2) ˆµ and ˆσ are normal distribution parameter estimates, φ( ) is the standard normal probability density function, and Φ( ) is the standard nor-

123 4. Data Transforms with Exponential Smoothing Methods of Forecasting 123 mal cdf. ˆµ is the point prediction, and ˆσ is calculated via the variance algorithm described in (2.7.4). 2. When a series transform is applied, the necessity to transform back into the original space means that (4.4.2) cannot be used. Instead, a quantile method is used (developed in collaboration with R. D. Snyder in 2011). This involves the following steps: (a) calculate the point prediction mean and variance ŷ n+j and Var(y n+j) in the transformed space. (b) Generate R evenly spaced probabilities going from 1 2R (these probabilities correspond to the quantiles). 2R 1 to 2R (c) Use the normal inverse function to find R transformed space values based on the mean and variance in step (a). (d) Convert these values into the real space by applying the inverse transformation. Eg, if the log transform was used, exponentiate here. (e) Sort the R quantile values, and calculate d i = y i+1 y i the distances between the sorted values. As this will have one less term than the number of quantiles, put d R = 0. k is the value immedi-

124 4. Data Transforms with Exponential Smoothing Methods of Forecasting 124 ately below the actual observed y. (f) The CRPS can now be approximated by the following formula, derived from (4.4.1): CRPS k 1 i=1 ( i R ) 2di + ( k R) 2(yobs y k ) + ( ) 1 k 2(yk+1 y R obs ) + R i=k+1 ( 1 i R) 2di. (4.4.3) If k = 0 only the last two terms appear; this is the case where the observed value is less than all R projections. If k = R only the first two terms appear. 3. When the error transforms are used, the quantile method cannot be used because the variance of y n+j is unknown and/or the distribution is non-normal. Instead, a simulation method is used. The HSS transform involves the following simulation method, for all withheld periods: y t N (µ t, (c + µ t ) 2λ σ 2 ) ; e t = y t µ t ; x t = Fx t 1 +ge t ; (4.4.4) µ t+1 = w x t.

125 4. Data Transforms with Exponential Smoothing Methods of Forecasting 125 This is repeated R times. For the JETS method, the simulation involves: ε t N(0, σ 2 ); e t = 1 κ sinh(κε t); µ t = w x t 1 ; (4.4.5) y t = µ t + e t ; x t = Fx t 1 +gε t. This is also repeated R times. We used R = 1000 for all RPS simulations and the quantile method. The t simulation method replaces the first two sub-equations of (4.4.5) with e t σt ν. In all cases, the y t s are simulated values. The procedure in parts (2e) and (2f) can now be used to find an approximate simulated CRPS. The disadvantage of using simulations is that the results can vary from one run to another, whereas for the quantile method they do not vary.

126 4. Data Transforms with Exponential Smoothing Methods of Forecasting Why the RPS and MLPL can give differing results For transforms, the Minus Log Prediction Likelihood (MLPL) can be defined for horizon 1 as: MLPL = 0.5 ( log(2π) + 2 log(ˆσ) + e 2 /ˆσ 2) J. (4.4.6) Here, ˆσ is the sample standard deviation and e is the point prediction error of the one-step ahead forecast. J is a Jacobian adjustment that depends on the transformation being used; it is calculated as given in the transform sections, but without the summations. There are some series where one model will do better than another model when the RPS is considered, but worse when the MLPL is considered. This is because the RPS measures the distance between two distributions, while the MLPL measures the negative of the log of the density of an actual observation y t+1 given information up to time y t. If we refer back to Figure 4.1, we can see that the JETS distribution with κ = 1 has a greater variance than the standard normal distribution. Suppose we observe y = 0. Both distributions have the same heights at this

127 4. Data Transforms with Exponential Smoothing Methods of Forecasting 127 point so the MLPL for both will be the same. However, the RPS is given by RPS = yobs F (y) 2 dy + yobs (1 F (y)) 2 dy Since a large part of the JETS distribution is outside the normal distribution, the RPS for the JETS distribution will be worse than for the normal distribution. On the other hand, if we observe y = 4 the JETS distribution clearly has a far better log likelihood at this point than the normal distribution, so the MLPL at this point will be better for the JETS distribution. However, the RPS measure will not necessarily agree because a great deal of the JETS distribution is far away from 4. In general, if we want to increase the density of actual future observations, we should use the MLPL; if instead, we want a reduced variance, the RPS should be used. We could have introduced the RPS in Chapter 3. However, the Chapter 3 results tables are already wide enough with just three performance measures, and the RPS results tabulated in this chapter and later in the thesis are very similar to the point prediction results in terms of ordering of transforms.

128 4. Data Transforms with Exponential Smoothing Methods of Forecasting Experimental Design The M3 data (Makridakis & Hibon 2000) set has 645 annual series, 756 quarterly series and 1428 monthly series. Six observations are withheld for annual series, eight for quarterly and eighteen for monthly. The following procedure was then used, with seven transforms and N the number of series in a particular frequency. 1. For each series, apply each transform to the six additive models described in Chapter 2 (three for annual data), and use maximum likelihood estimation to optimise all parameter values, including seed states. 2. Select the best model for each transform using the AIC; this model was used for predictions. 3. Use the rolling origin method to get results for this model for each future horizon period for the ASE, APE and RPS; for the MLPL, only the first horizon was used. 4. Average over each lead time to get one prediction measure result for each lead time for each series, as illustrated in Table 3.2.

129 4. Data Transforms with Exponential Smoothing Methods of Forecasting The overall prediction measure for a series will then be an average of the prediction measures obtained in step 4. This means that we now have an N 7 data table for each prediction measure. 6. Means and medians for all transforms can now be found. 4.6 Results Table 4.1 shows the mean results for each seasonal frequency. Although medians were calculated, they were very similar across the transforms, and means gave both a better spread and more statistical techniques for analysis. The transform codes are: ID identity, no transform Log log transform BC Box-Cox HSS p HSS proportional c approach HSS e HSS estimation approach

130 4. Data Transforms with Exponential Smoothing Methods of Forecasting 130 JETS Johnson error transform t using t distribution Tab. 4.1: Original Transform Means, Best Transforms in Bold Annual (645 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Quarterly (756 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Monthly (1428 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL In some series the observed data will be relatively small and in other series large. In many series, the forecasts will be fairly close to the actual data, but in a few series, the forecasts are very poor. For these reasons, a log transform was applied to the data obtained in Section 4.5, producing an N 7 table for each forecasting measure, where the measure itself was logged. A two-

131 4. Data Transforms with Exponential Smoothing Methods of Forecasting 131 way ANOVA was then applied to this table, with series as the row factor and transform as the column factor. In Table 4.2, both series and transform were always significant, and the question was which particular transforms were significantly better than others. Conclusions for each data frequency are summarised below Table 4.5. Assumptions required for ANOVA to be used, such as similar variances of the logged transforms, were satisfied. Tab. 4.2: Means of Logged Transform Data, Best Transforms in Bold Log Annual (645 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Log Quarterly (756 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Log Monthly (1428 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL

132 4. Data Transforms with Exponential Smoothing Methods of Forecasting Tukey s Comparisons Tables 4.3, 4.4 and 4.5 give grouping information, based on pairwise testing between transforms, adjusted for multiple comparisons using Tukey s method. Statistical inference on these tables uses the same methods as in Chapter 3. Tab. 4.3: Transform Groupings for Annual Data log MASE log MAPE Log RPS Log MLPL trans Mean trans Mean trans Mean trans Mean Log a Log 2.526a Log 6.302a ID 2.224a HSS p b HSS p 2.440b t 6.217b BC 2.196ab HSS e b HSS e 2.435b ID 6.210b t 2.183bc t b t 2.431b JETS 6.200b JETS 2.177bcd ID b ID 2.427b HSS p 6.194b Log 2.176bcd JETS b JETS 2.424b BC 6.187b HSS e 2.155cd BC b BC 2.411b HSS e 6.187b HSS p 2.143d Tab. 4.4: Transform Groupings for Quarterly Data log MASE log MAPE Log RPS Log MLPL trans Mean trans Mean trans Mean trans Mean Log a Log 1.759a Log 5.476a BC 1.984a BC ab BC 1.727b BC 5.446b ID 1.981ab HSS e b HSS e 1.719b HSS p 5.438b HSS e 1.969abc HSS p b HSS p 1.719b HSS e 5.436b JETS 1.968abc ID b JETS 1.704b t 5.427b Log 1.966bc JETS b ID 1.703b ID 5.426b t 1.965bc t b t 1.699b JETS 5.424b HSS p 1.961c

133 4. Data Transforms with Exponential Smoothing Methods of Forecasting 133 Tab. 4.5: Transform Groupings for Monthly Data log MASE log MAPE Log RPS Log MLPL trans Mean trans Mean trans Mean trans Mean ID a ID 2.194a t 5.715a ID 1.973a t ab t 2.191a ID 5.708ab HSS e 1.970ab BC ab BC 2.178ab JETS 5.692bc BC 1.965bc JETS ab JETS 2.177ab BC 5.687c HSS p 1.963bc HSS p ab HSS p 2.176ab HSS p 5.684c t 1.962bc HSS e ab HSS e 2.176ab Log 5.681c JETS 1.962c Log b Log 2.160b HSS e 5.680c Log 1.960c Annual For the RPS, MASE and MAPE measures, the log transform is significantly worse than all other transforms, but there is no significant difference between these other transforms. For the MLPL measure, the identity transform is significantly worse than most other transforms, and the HSS proportional c approach looks like the best method. Since HSS proportional c is not significantly worse than other transforms on the other three measures, it should be selected. Quarterly The quarterly data looks very similar to the annual data in its performance. For RPS, MASE and MAPE, the log transform is significantly worse than all other transforms, which are grouped together. For the MLPL, the HSS

134 4. Data Transforms with Exponential Smoothing Methods of Forecasting 134 proportional c approach appears to do best, and so it should be used. Monthly The simple log transform does very well on all monthly performance measures. This performance was a big surprise given that it does poorly on most measures for quarterly and annual data; this could be because many monthly series had multiplicative components, and the log transform makes these components additive. For monthly data, the log transform is always significantly better than the identity transform, and is always in the same grouping as the best other transforms. In this respect, the monthly log transform result is relatively better when the data is log transformed than when it is left as original. In general, for the MLPL measure in annual and quarterly data, and for all performance measures in monthly data, at least one transform is significantly better than the non-transformed models. Since the HSS proportional c approach is never significantly worse than any other transform on any performance measure or frequency, we recommend it. We surmise that this is because the HSS transform can act as either an additive error model, multiplicative error model, or as a bit of both, and that

135 4. Data Transforms with Exponential Smoothing Methods of Forecasting 135 the c parameter is better chosen using our procedure than by estimation Encompassing Model vs AIC Method In the above analysis, we have used the AIC for model selection. However, it is also regarded as legitimate to just use the encompassing model; this will be the non-seasonal damped trend model for annual data, and the seasonal damped trend model for quarterly and monthly data. Table 4.2 was thus repeated using the encompassing model, and differences between this table and Table 4.2 were then found, and presented in Table 4.6. The results show that on most occasions using the best model identified by the AIC was at least as good as using the encompassing model, and so the use of the AIC method was retained. 4.7 Conclusions In this chapter, we have introduced various transforms that can be applied to both the series and the errors. We have used distribution and point forecasting measures to compare transforms. We have applied the transforms to the

136 4. Data Transforms with Exponential Smoothing Methods of Forecasting 136 Tab. 4.6: AIC - ENC Log Means (Negatives imply that AIC is better) Log Annual (645 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Log Quarterly (756 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL Log Monthly (1428 series) ID Log BC HSS p HSS e JETS t MASE MAPE RPS MLPL

137 4. Data Transforms with Exponential Smoothing Methods of Forecasting 137 M3 data, and have then used ANOVA to test for significant differences. We have found that the log transform is not significantly worse than any other transform for the monthly data. On the annual and quarterly data, the log transform is significantly worse than any other transform on ASE, APE and RPS measures, but the MLPL measure has the identity transform doing significantly worse than some of the other transforms. In all measures for monthly data, at least one transform does significantly better than the identity transform. Since the HSS proportional c approach is never significantly worse than any other transform, we recommend it for use.

138 5. A SIMULATION STUDY OF THE TRANSFORMS 5.1 Introduction In the preceding chapter we applied seven transforms to the M3 data, and evaluated their performance. To gain further insights into why the various transforms perform as they do, we used a simulation study. For this study, we used the same simulation procedures and settings as those described in Koehler, Snyder, Ord and Beaumont (2012). A series of length 72 that used a local level model was generated; this series has l 0 = 100, α = 0.2 and σ = observations were withheld for prediction validation, meaning that 60 were used for fitting. At observation 58, the series jumped by 5σ = 25 to a new level of 125. What happened next depended on the type of simulation being conducted. In the case of an additive outlier (AO),

139 5. A Simulation Study of the Transforms 139 the series would immediately revert back to its previous level. For a level shift (LS), the series would continue at a new level, higher than its old level. Finally, for a transitory change (TC), the series would revert back to its old level over a period of time. Note that this means that a marked perturbation occurs in the series, very close to the point at which predictions are required: just three time points before the end of the data used for fitting. This is deliberate; the choice is therefore a strong challenge to the capacity of any method, in that there is very little post-perturbation data for the method to adjust its predictions. The Koehler et al. (2012) paper experimented with having the perturbation at periods 30, 50, 55 and 58. As exponential smoothing is able to adjust to perturbations quickly, the effect of the perturbation on forecasts was greatest when the perturbation occurred at period 58, and so period 58 was selected for the experiments in this chapter. More formally, the simulated series used the local level model formulation,

140 5. A Simulation Study of the Transforms 140 but included an extra term. Thus we have: y t = l t 1 + pa t + e t ; l t = l t 1 + αe t ; (5.1.1) e t N(0, σ 2 ). p is the magnitude of the spike at period 58, and here we set it to 5σ = 25. A t = 0 until the jump at period 58. For the AO case, A t = 1 only at period 58, and returns to zero immediately afterwards. For the LS case, A t = 1 from period 58 onwards. For the TC case, from period 58, A t = δ t 58, where t is the current period. The AO case can be viewed as corresponding to the TC case with δ = 0, since lim x 0 x x = 1. δ = 1 is equivalent to the LS case, so the range for δ for the TC case is 0 < δ < 1. In our simulation, we set δ = 0.5. We created 2000 series for each type of simulation. Figure 5.1 shows the means of the three simulated models, from observations 45 to 72. Note that these models do not reflect the variation in individual series. An individual AO series is shown in Figure 5.2. The spike at period 58 is clear, but the series is volatile elsewhere.

141 5. A Simulation Study of the Transforms 141 Fig. 5.1: Models of Types of Simulations

142 5. A Simulation Study of the Transforms 142 Fig. 5.2: Individual AO series

143 5. A Simulation Study of the Transforms 143 For each series, we used the fitting sample of 60 observations to compute the smoothing and initial state estimates of the local level, local trend and damped trend models for all seven transforms. We then selected the best of these three models using the AIC. Using the model selected by the AIC, 12 forecasts were made for each transform using the rolling origin method, and these forecasts are compared with the actual future simulated values using the MASE, MAPE, RPS and MLPL prediction performance measures. The same methods used in the previous chapter were used to create grouping tables for the log of each performance measure. The local level model was used to generate all simulated series. For each simulation type and transform, we recorded the percentage of times that the (correct) local level model (out of local level, local trend and damped trend models) was actually selected, and the mean value of ˆα over all 2000 local level models.

144 5. A Simulation Study of the Transforms LS Results Table 5.1 gives the grouping information for transforms using 2000 LS simulated series after using ANOVA, and Table 5.2 gives the % correct and mean ˆα statistics. Tab. 5.1: Log LS Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 0.434a JETS 1.896a JETS 1.870a JETS 1.381a Log 0.050b Log 1.514b Log 1.516b Log 1.216b HSS e 0.015c HSS e 1.479c HSS e 1.486c HSS e 1.202c HSS p 0.015c HSS p 1.479c HSS p 1.482c HSS p 1.200c ID 0.005cd ID 1.460cd BC 1.433d BC 1.186d BC 0.005cd BC 1.459cd t 1.430d ID 1.186d t 0.014d t 1.450d ID 1.430d t 1.184d Tab. 5.2: % of times correct local level model selected, and mean ˆα for LS case Transform ID Log BC HSS p HSS e JETS t % Correct 93% 91% 90% 81% 90% 90% 92% Mean ˆα Table 5.2 shows that the correct local level model is selected by the AIC about 90% of the time, although it is lower for HSS proportional c. The mean estimated ˆα of about 0.48 is an overestimate of the actual α of 0.2; this happens because a high ˆα means that the fitted series will catch up to the actual level faster, thus giving better forecasts.

145 5. A Simulation Study of the Transforms 145 JETS performs worst because it assumes a wrong kind of outlier model. Because the error at the jump point is large, it is diminished by using ε t = 1 κ sinh 1 (κe t ). Since ε t instead of e t is being used in the transition Equation (2.5.3), the level of the series after the level shift will be underestimated relative to other transforms, and this underestimation will continue as more points from the new level are added. The Box-Cox, ID and t transforms are best because in this case it is better not to use transforms on either observations or errors, as the fitted values will approach the new level faster when there are no transforms. The mean estimated λ for the Box-Cox transform was 0.90, which means that the Box- Cox is similar to the non-transformed series. 5.3 AO Results Table 5.3 gives the grouping information for transforms using 2000 AO simulated series, and Table 5.4 gives the % correct and mean ˆα statistics. In contrast to the LS case s 90% correct model selected, the AO case only results in the selection of the local level model about 56% of the time by the

146 5. A Simulation Study of the Transforms 146 AIC. It is clear that the one-period series spike often leads to a local trend or damped trend model being selected. Tab. 5.3: Log AO Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 1.184a JETS 2.644a JETS 2.685a ID 2.084a t 1.140b t 2.600b t 2.613b BC 2.067ab Log 1.116c Log 2.576c Log 2.585c HSS e 2.047bc ID 1.100d ID 2.561d ID 2.578cd HSS p 2.046c BC 1.099d BC 2.559d BC 2.576cd Log 1.978d HSS p 1.094d HSS p 2.554d HSS p 2.566d t 1.900e HSS e 1.093d HSS e 2.553d HSS e 2.565d JETS 1.860f Tab. 5.4: % of times correct local level model selected, and mean ˆα for AO case Transform ID Log BC HSS p HSS e JETS t % Correct 54% 58% 52% 59% 56% 56% 54% Mean ˆα We would expect that the AO performance measures would be better than the LS performance measures, because there is a big increase in the LS series that should take a few observations to catch up to. However, we are using a rolling origin method, and the volatility in the series means that an underestimate of α is bad. The average ˆα using a local level model was about 0.12 for AO series, compared with 0.48 for LS series. Since the actual α is 0.2, this means that the forecasts for the AO were not being adjusted quickly enough. That the correct local level model is only selected just over half the time is also

147 5. A Simulation Study of the Transforms 147 not good for prediction performance. The reason we get a low ˆα for the AO case is because of the sharp up and down movement in the series. A high ˆα would mean two big misses, while a low ˆα means there is only one big miss. Optimisation routines will thus select a low ˆα to try to maximise the log likelihood. The JETS and t distributions perform well on the MLPL density measure, but badly on the other three measures. This is because the MLPL measures how well the possible outcomes of a distribution are being covered, while the other measures are more focused on the mean. The JETS distribution s slow adjustment to new data, combined with the underestimate of α, adversely affects its performance on MASE, MAPE and RPS. The t distribution tends to have a low df, meaning that the tails are longer. A percentage histogram of ˆα for the local level model shows that a zero ˆα is selected by the t transform about 33% of the time, while the identity transform only selects a zero ˆα 27% of the time. Longer tails cope better with the one-period outlier, but the higher proportion of zero ˆα s makes the t transform slow to adjust to new observations.

148 5. A Simulation Study of the Transforms TC Results Table 5.5 shows the grouping information for the TC case, and Table 5.6 gives the mean ˆα estimate and percentage of times the correct local level model is selected for all transforms. The AIC selects the local level model about 72% of the time, making the TC case approximately between the AO and LS cases. Tab. 5.5: Log TC Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean Log 0.947a Log 2.407a Log 2.384a ID 1.804a ID 0.931ab ID 2.391ab ID 2.380a BC 1.738b JETS 0.921b JETS 2.381b JETS 2.379a Log 1.731b BC 0.819c BC 2.280c BC 2.253b HSS p 1.728b HSS p 0.814c HSS p 2.274c HSS p 2.248b HSS e 1.727b HSS e 0.810c HSS e 2.270c HSS e 2.244b JETS 1.661c t 0.810c t 2.270c t 2.239b t 1.623d Tab. 5.6: % of times correct local level model selected, and mean ˆα for TC case Transform ID Log BC HSS p HSS e JETS t % Correct 78% 76% 70% 73% 70% 70% 72% Mean ˆα The ID and log transforms perform significantly worse than other transforms for all performance measures when the TC case is used, as these transforms underestimate the true value of α. The t transform is clearly the best for the

149 5. A Simulation Study of the Transforms 149 TC case because the mean ˆα estimate for the t transform was 0.20, which is the true value to within rounding error. The t distribution will continue to have long tails and delivers better point forecasts, so it outperforms the JETS distribution on MLPL. 5.5 Overall Results Rather than analysing each type of simulation separately, in this section we combined the three simulation types, and performed our analysis using the combined results. This was done so our study would be closer to real life, where it is not known whether an AO, LS or TC jump type has occurred. Owing to computational limitations, Table 5.7 uses a total of 3000 series, 1000 from each simulation type. Table 5.8 uses all 6000 series. Tab. 5.7: Log All Simulation Type Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 0.847a JETS 2.307a JETS 2.310a ID 1.685a Log 0.698b Log 2.159b Log 2.154b BC 1.658b ID 0.674c ID 2.136c ID 2.126c HSS p 1.656b t 0.639d t 2.101d HSS p 2.095d HSS e 1.655b HSS p 0.639d HSS p 2.100d HSS e 2.092d JETS 1.635c HSS e 0.635d HSS e 2.096d t 2.086d Log 1.635c BC 0.634d BC 2.096d BC 2.082d t 1.565d

150 5. A Simulation Study of the Transforms 150 Tab. 5.8: % of times correct local level model selected, and mean ˆα for all cases Transform ID Log BC HSS p HSS e JETS t % Correct 75% 75% 69% 71% 72% 72% 73% Mean ˆα The MASE, MAPE and RPS performance measures all have the JETS transform performing significantly worse than any other transform. The log transform is significantly worse than ID, which is then significantly worse than the other four transforms. For the MLPL measure, ID is significantly worse than any other transform, and t significantly better than any other transform. JETS does much better on MLPL than on the other performance measures. As a result, the t transform would be recommended for this simulation study. 5.6 A General Overview of the HSS Results For the M3 data, the HSS proportional c transform was never significantly worse than any other transform. The picture is somewhat different for these simulated series. The HSS proportional c and HSS estimation transforms are very close on all performance measures. These simulated series did not include very small

151 5. A Simulation Study of the Transforms 151 observations, so the value of c in the HSS formulation did not matter very much. For the LS case, the HSS transforms were significantly worse than the t transform for all performance measures. For the AO case, HSS transforms are best in point terms on MASE, MAPE and RPS, but significantly worse than any of log, t and JETS for MLPL. For the TC case, the HSS transforms are not significantly worse than the best transform (t) on MASE, MAPE and RPS, but they are significantly worse than both t and JETS on MLPL. For all simulation types, the distribution of the HSS parameter λ was highly polarised at either zero or one. For the LS type, over 60% of local level ˆλ s are at 1, and just under 30% at 0. The distribution of ˆλ between 0 and 1 is more balanced for the other two cases, with 50% at 1 and over 40% at 0. A λ = 1 is effectively a multiplicative error model, but this relative error model is unsuited to the simulated LS case. An explanation for this is that when λ = 1, the relative errors are inversely proportional to µ t. Since the LS case increases the value of the series, the smaller µ t will produce larger relative errors after the LS jump point. The mean ˆα for the HSS estimation approach is 0.49 when λ is not near 1, but 0.47

152 5. A Simulation Study of the Transforms 152 when λ is near 1, meaning that the relative error model is not approaching the post-jump series as fast as the identity transform. Referring back to Equation (3.2.2), the expression for the log likelihood of the multiplicative error model is: log L = n 2 (1 + log(2π) + 2 log(s)) log( µ t ). (5.6.1) Because the simulated series are from an additive error model, multiplicative error models would be expected to underperform additive error models. The M3 results for the HSS parameter ˆλ had about 67% at either zero or one, while the simulation results have over 90% at these two extremes. 5.7 Some Additional Randomisation of the Simulations The previous sections used the fixed parameters l 0 = 100, α = 0.2, σ = 5, p = 5σ = 25 and δ = 0.5 in the TC case. In this section, we randomise p and δ for each series, such that p U(2, 8)σ and δ U(0.1, 0.9). A U(2, 8)σ distribution for p was selected so that the mean would still be 5σ, with the minimum of 2σ chosen so there would still be some discernible jump. Similar

153 5. A Simulation Study of the Transforms 153 logic applies for the distribution of δ; it was chosen so that the expected value of δ was 0.5. Tables 5.9 to 5.16 show grouping information and other statistics for the randomised simulations. Tab. 5.9: Log LS Random Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 0.675a JETS 2.138a JETS 2.106a JETS 1.507a BC 0.418b BC 1.884b Log 1.855b ID 1.378b Log 0.415b Log 1.881bc HSS e 1.846bc HSS e 1.377b HSS e 0.410bc HSS e 1.877bc HSS p 1.843bcd HSS p 1.377b HSS p 0.409bc HSS p 1.876bc BC 1.840bcd BC 1.374bc ID 0.396bc ID 1.862bc ID 1.817cd Log 1.359cd t 0.384c t 1.851c t 1.813d t 1.353d Tab. 5.10: % of times correct local level model selected, and mean ˆα for LS random case Transform ID Log BC HSS p HSS e JETS t % Correct 86% 87% 80% 78% 85% 85% 84% Mean ˆα For the LS and AO cases, there is a reduction in the significance of differences between the transforms for the random parameters as opposed to the fixed parameters. However, the ordering of the various transforms is similar. For the LS results, both the estimated α and % of times the local level model is correctly selected are down somewhat on the fixed parameter results. For the AO results, there is a slight increase in the % correct, while ˆα is unchanged

154 5. A Simulation Study of the Transforms 154 Tab. 5.11: Log AO Random Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 1.187a JETS 2.648a JETS 2.690a ID 2.092a t 1.143b t 2.604b t 2.618b HSS p 2.075ab Log 1.126c Log 2.587c Log 2.597c HSS e 2.072b ID 1.105d ID 2.566d ID 2.583cd BC 2.067b BC 1.103de BC 2.563de BC 2.578de Log 1.995c HSS p 1.092de HSS p 2.553de HSS p 2.566e t 1.934d HSS e 1.091e HSS e 2.552e HSS e 2.565e JETS 1.922d Tab. 5.12: % of times correct local level model selected, and mean ˆα for AO random case Transform ID Log BC HSS p HSS e JETS t % Correct 57% 60% 56% 61% 53% 53% 57% Mean ˆα Tab. 5.13: Log TC Random Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 0.932a JETS 2.392a JETS 2.396a ID 1.771a Log 0.826b Log 2.288b t 2.269b HSS p 1.768a t 0.823b t 2.285b Log 2.264b HSS e 1.765ab HSS p 0.81bc HSS p 2.272bc HSS p 2.254bc BC 1.750bc HSS e 0.81bc HSS e 2.272bc HSS e 2.254bc JETS 1.735c ID 0.798c ID 2.261c ID 2.242c Log 1.699d BC 0.798c BC 2.260c BC 2.240c t 1.699d Tab. 5.14: % of times correct local level model selected, and mean ˆα for TC random case Transform ID Log BC HSS p HSS e JETS t % Correct 67% 69% 62% 69% 62% 62% 67% Mean ˆα

155 5. A Simulation Study of the Transforms 155 Tab. 5.15: Log All Simulation Type Random Results by Grouping Log MASE Log MAPE Log RPS Log MLPL Trans Mean Trans Mean Trans Mean Trans Mean JETS 0.919a JETS 2.381a JETS 2.382a ID 1.733a Log 0.775b Log 2.238b Log 2.223b HSS p 1.728ab t 0.765bc t 2.227bc t 2.213bc HSS e 1.726ab BC 0.764bc BC 2.227bc BC 2.209bc BC 1.718b HSS e 0.758bc HSS e 2.221bc HSS e 2.207bc JETS 1.717b HSS p 0.758bc HSS p 2.221bc HSS p 2.207bc Log 1.674c ID 0.755c ID 2.218c ID 2.200c t 1.649d Tab. 5.16: % of times correct local level model selected, and mean ˆα for all random outliers Transform ID Log BC HSS p HSS e JETS t % Correct 70% 72% 66% 69% 67% 67% 69% Mean alpha on the fixed parameter results. The TC results show more changes over the fixed results. In the TC case, from period 58, A t = δ t 58. If δ = 0.5 then after one period since the jump point, the average y will be 0.5p, and 0.25p at t = 60. However, if δ = 0.9, the average y will be 0.9p at t = 59 and 0.81p at t = 60. The U(0.1, 0.9) distribution only applies for t = 59, and after that there will be a greater concentration of points near the level prior to the jump, with a few points still well above that level.

156 5. A Simulation Study of the Transforms 156 The Box-Cox and ID transforms do better when using random parameters for the TC simulation type on MASE, MAPE and RPS, while the t transform performs significantly worse compared to the ID and Box-Cox transforms. The average ˆα values for both ID and Box-Cox increased to 0.24, and, for these simulations, it is better to slightly overestimate α than to slightly underestimate it. Due to the major differences in the random TC parameter results, the overall results also showed changes. Now ID is the best transform on MASE, MAPE and RPS measures, though it is only significantly better than JETS and log on these measures. However, the MLPL measure still has the ID as the worst transform, and it is significantly worse on this measure than any of the Box- Cox, JETS, log or t transforms. The t transform is still significantly better than any other transform on the MLPL measure. 5.8 Conclusions In this chapter, we have used simulations to investigate the performance of the transforms. We have found that the JETS transform performs poorly

157 5. A Simulation Study of the Transforms 157 on MASE, MAPE and RPS measures for all simulation types. When all simulation types are combined, the identity transform is significantly worse on the MLPL measure than all other transforms on fixed parameters, and significantly worse than all but two other transform on random parameters. However, the identity transform is the best in point terms for MASE, MAPE and RPS on all transforms with random parameters. The HSS transforms do not perform as well in this simulation study as they did on the M3 data. This happens because the HSS transform allows for multiplicative error models, but all simulated series used additive error models. As the HSS transform is the best transform for the real data, we still recommend it. While the simulation study in this chapter offers some additional insight into why transforms perform as they do, real data is still the ultimate test of performance.

158 6. NON-NEGATIVE DISTRIBUTIONS FOR TIME SERIES 6.1 Introduction The basic form for an additive state space model, from Equations (2.5.1) to (2.5.3), is: e t = y t w x t 1 ; x t = Fx t 1 + ge t. (6.1.1) We select our models from the six additive models from Table 2.1; these models are: 1. Local level 2. Local trend 3. Damped trend

159 6. Non-Negative Distributions for Time Series Seasonal with local level 5. Seasonal with local trend 6. Seasonal with damped trend Only the first three models can be applied to non-seasonal data. As we saw in Chapter 2, the sample log likelihood for the normal distribution can be estimated given the standard deviation of the errors ˆσ, and is given by (2.6.2) where n is the series length: log L N = n (1 + log(2π) + 2 log(ˆσ)). 2 A problem with the normal distribution is that most economic time series are always positive, but the normal distribution is unbounded, and so there will be at least a small probability of getting a negative observation according to the model. For some series, there is a downward trend near the end of the series, and this can then lead to negative point predictions. It is nonsensical to have a non-negative time series, like the number of cars sold, with negative predictions, so this problem should be avoided. This is why three non-negative distributions are suggested in this chapter: the gamma distri-

160 6. Non-Negative Distributions for Time Series 160 bution, the lognormal distribution and the truncated normal distribution. We have two separate issues here: the fits and the forecasts; non-negative distributions can be applied to just the forecasts, or both the fits and the forecasts. When applied to just the forecasts, the normal distribution is used for the fits. We applied both methods to the M3 data in Sections 6.2 to 6.4; results are tabulated in Section The Gamma Distribution The Gamma Distribution Formulation The gamma distribution is used in the context of time series, for example in Chapter 15 of Hyndman et al. (2008a). We define the gamma probability density function (pdf) as: f(x a, b) = 1 ( b a Γ(a) xa 1 exp x ), x > 0. (6.2.1) b Here a > 0 and b > 0 are shape and scale parameters, respectively, and x > 0 is the variable of interest. There are other parameterizations of the

161 6. Non-Negative Distributions for Time Series 161 gamma distribution; the one used here is that specified in the Matlab Manual (2010), and is convenient because Matlab was used for most computer work in this thesis. Using the Matlab parameterization, the mean of the gamma distribution is given by ab, and its variance by ab 2. When Equation (6.1.1) is applied to a time series, we will obtain the predictions and errors for the fitting sample. We then need to convert these into shape and scale parameter estimates by solving simultaneous equations so that the gamma log likelihood can be estimated; the shape and scale estimates can be calculated from the predictions, ˆµ t, and error standard deviation, ˆσ, as: ˆbt = ˆσ 2 /ˆµ t ; â t = ˆµ t /ˆb t = ˆµ 2 t /ˆσ 2. (6.2.2) The estimates â t and ˆb t can now be used to calculate the gamma log likelihood for observations y t ; this is given by: log L G = n t=1 ( â t log(ˆb t ) log(γ(â t )) + (â t 1)log(y t ) y t /ˆb t ). (6.2.3) The gamma log likelihood uses method of moments estimates of a t and b t because (6.1.1) gives us predictions and errors, and does not tell us how new

162 6. Non-Negative Distributions for Time Series 162 values of â t and ˆb t can be calculated as we progress through the time series. This is different to what would apply if we were fitting a gamma distribution to time-independent observations. The Matlab function gammaln can be used to calculate log(γ(â t )) without experiencing computational problems due to very large numbers. For example, Γ(200) gives an Inf output in Matlab, but gammaln(200) gives Gamma Prediction Distributions Simulation needs to be used to obtain distributional statistics for the gamma distribution, as we do not know how the smoothing and damping parameters will affect the gamma density for more than one period into the future. The following form uses (6.2.2) to convert µ t and σ into gamma parameters a and b. This is used for t = n + 1, n + 2,..., n + h, where h is the number of

163 6. Non-Negative Distributions for Time Series 163 forecasts required: y t = randgamma(µ t 1, σ) (µ & σ converted to a & b first); e t = y t µ t 1 ; x t = Fx t 1 +ge t ; (6.2.4) µ t = w x t. This simulation is repeated R times, so that an R h matrix of y t s is obtained; we used R = Averaging over each column will give the point predictions ˆµ t for all future time periods. The σ t can be estimated by taking the standard deviations of all columns. The methods described in Chapters 3 and 4 can now be used to obtain prediction measures such as MASE, MAPE, RPS and MLPL. A rolling origin is used. If the coefficient of variation (CV), defined as CV t = ˆσ t / ˆµ t is small, then the gamma parameter estimate â t will be large, and can lead to a problem with calculating Γ(â t ) in the gamma distribution function. For this reason, for any case with CV t < 0.1 the gamma distribution was replaced with the normal distribution for that particular simulation. As a t, the gamma distribution with this parameterization tends to the N(ab, ab 2 ) distribution.

164 6. Non-Negative Distributions for Time Series 164 This simulation method does not solve the problem of what happens if a steep downward trend is observed. It is still possible for a prediction of a negative value to arise. This is not a legal parameter choice under the gamma distribution. In the event it occurs, the simulated y t are set to 0; this is similar to the converge-to-zero problem that was observed for count data (Snyder et al., 2012) because, once the simulated values are at zero, they will remain at zero. 6.3 The Lognormal Distribution The lognormal distribution is used for time series in Akram, Hyndman and Ord (2007). X is said to be lognormal if log X N(µ, σ 2 ); this means that the log of X is normally distributed. Because X exponentiates the normal distribution, negative x cannot occur. If m represents the mean of X and V its variance, m and V can be calculated from µ and σ using Watson (1983, pg 423): m = exp(µ + σ 2 /2); (6.3.1) V = exp(2µ + σ 2 )[exp(σ 2 ) 1]. (6.3.2)

165 6. Non-Negative Distributions for Time Series 165 The lognormal distribution has a pdf in original space of: f(x µ, σ 2 ) = ) 1 ( xσ 2π exp (log(x) µ)2, x > 0. (6.3.3) 2σ 2 By putting ε t = log(x t ) µ t, so that ε t 1 x N(0, σ2 ), the log likelihood for the lognormal distribution can be derived using the same methods as in Chapter 2 as: ( log L = log(yt ) + n log(ˆσ 2π) + n ). (6.3.4) 2 The best way to apply the lognormal distribution to time series is to log the actual y, the method used in Chapter The Truncated Normal Distribution The Basic Idea The truncated normal distribution cuts off the normal distribution at 0, thus ensuring there can be no negative values. As a result, the pdf of the truncated normal is the pdf of the full normal, conditional on x > 0. Figure 6.1 has

166 6. Non-Negative Distributions for Time Series 166 some graphs of truncated normal distributions. The numbers in the legend are respectively the mean and standard deviation of that particular normal distribution when truncation is not used. Clearly, the means of all these distributions will be greater than their means without truncation. If we use TN to represent the truncated normal and N to represent the full normal, we can calculate the truncated normal mean and variance given the corresponding full normal mean and variance using the following formulae derived from Chapter 22 of Greene (2003). η = µ N /σ N ; λ(η) = φ(η) 1 Φ(η) ; δ(η) = λ(η)[λ(η) η]; (6.4.1) µ T N = µ N + σ N λ(η); σt 2 N = σ2 N [1 δ(η)]. Here, φ(η) and Φ(η) are respectively the probability density and cumulative density functions of the standard normal distribution at the point η. The pdf of a non-negative observation x is given by (Johnson, Kotz & Bal-

167 6. Non-Negative Distributions for Time Series 167 Fig. 6.1: Pdfs of truncated normal random variables; the truncation occurs at zero. Legend numbers show µ and σ of normal distribution when truncation is NOT used

168 6. Non-Negative Distributions for Time Series 168 akrishnan, 1994, Section ; Greene, 2003, Chap 22): W = 1 Φ(ˆη); z = x ˆµ N ˆσ N ; f(x : µ N, σ N ) = φ(z) ˆσ N W ; (6.4.2) log f = 1 2 (log(2π) + z2 ) log(ˆσ N ) log(w ). Simulated values can be generated using the quantile method. R probabilities are evenly spaced between Z and 1, excluding endpoints. The normal inverse function can now be used to generate the R x-values, using a normal inverse with parameters µ N and σ N. We used R = Application to Time Series Data Applying the truncated normal (TN) distribution to time series was the author s idea. The following procedure was used; the TN distribution in this procedure was applied only to the forecasts. 1. Use the normal distribution to estimate the fits and σ for the fitting sample. Since the observations are all positive, the fits are almost

169 6. Non-Negative Distributions for Time Series 169 certain to be all positive. 2. Use the algorithms in Section 2.7 to project the future point predictions and variances from the normal distribution. 3. Convert these normal point predictions and variances to truncated normal point predictions and variances. Simulate the prediction distributions to calculate the RPS. Calculate the truncated normal probabilities of actual future observations for the MLPL. If applying the TN distribution to the fits as well, step 1 should be replaced with a TN distribution based on the fits and ˆσ calculated from the normal distribution. Unfortunately, one annual series had a very sudden drop, which leads to very big negative projections for the future time periods for a local trend model. The standardised ˆη for this series was about 20, and Matlab does not calculate W = 1 Φ(ˆη) accurately when Φ(ˆη) is very close to one. Effectively, for this series, W = 1 ( ) = 0 to computational precision. The best way to circumvent this problem is to use Φ( η) instead of 1 Φ(η) in both Equations (6.4.1) and (6.4.2). Because the normal distribution is

170 6. Non-Negative Distributions for Time Series 170 symmetrical, Φ( η) = 1 Φ(η). Using the left tail means that computational problems that occur when Φ(η) is very close to 1 no longer occur, as Φ( η) can be represented as 10 to a large negative power. Because φ( η ) will also be very small for large η, this means that φ(η) = φ( η) 1 Φ(η) Φ( η) can now be calculated for a much larger value of η. Thus, the normal distribution can now be used for η up to Results Results for the M3 data are presented in Table 6.1 using the normal distribution, gamma distribution, truncated normal distribution and lognormal distribution. These results are all for time series without transforms. The results have been calculated using the same methods as in Chapters 3 and 4. The best truncations on each prediction measure are bolded. Here are the truncation codes used in Table 6.1. The lognormal distribution is only applied to the forecasts because the log transform was applied to the fits in Chapter 4. Norm full normal distribution

171 6. Non-Negative Distributions for Time Series 171 G fit gamma distribution applied to fits as well as forecasts G NF gamma distribution applied just to forecasts LN - lognormal distribution applied just to forecasts TN fit truncated normal distribution applied to fits as well as forecasts TN NF truncated normal distribution applied just to forecasts Tab. 6.1: Averages of Prediction Measures Annual (645 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL Quarterly (756 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL Monthly (1428 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL The gamma fit distribution looks best for annual data, with the exception of the MLPL. On quarterly data, the normal and truncated normal distri-

172 6. Non-Negative Distributions for Time Series 172 butions do about equally well, with the gamma a long way behind. With the exception of the MAPE, the truncated normal distribution does well on monthly data. The same methods as in Chapters 4 and 5 were used to analyse the logged truncations. For the truncation factor, Table 6.2 was obtained. Best truncations for each prediction method are bolded. Tab. 6.2: Averages of Logs of Prediction Measures Log Annual (645 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL Log Quarterly (756 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL Log Monthly (1428 series) Norm G fit G NF LN TN fit TN NF MASE MAPE RPS MLPL The log transform has decreased the relative performance of the gamma fit distribution, so that it is no longer the best transform for annual data.

173 6. Non-Negative Distributions for Time Series 173 ANOVA was used to analyse the logged truncation data, and pairwise tests were performed using Tukey s multiple comparison method. Grouping tables of each frequency are given in Tables 6.3, 6.4 and 6.5. Findings for each frequency are then summarised. Tab. 6.3: Grouping Results for Annual Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean TN fit 0.699a TN fit 2.440a TN fit 6.223a TN fit 2.268a Norm 0.687a Norm 2.427a G fit 6.217a G fit 2.256ab G fit 0.686a G fit 2.427a G NF 6.215a LN 2.243abc G NF 0.686a G NF 2.426a LN 6.214a G NF 2.234bc LN 0.686a LN 2.425a Norm 6.210a Norm 2.224c TN NF 0.679a TN NF 2.420a TN NF 6.201a TN NF 2.224c Tab. 6.4: Grouping Results for Quarterly Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean G fit 0.419a G fit 1.738a G fit 5.457a G fit 1.994a TN fit 0.405a TN fit 1.725a TN fit 5.440ab LN 1.991ab Norm 0.385b Norm 1.703b LN 5.429b TN fit 1.989ab LN 0.384b LN 1.703b G NF 5.429b G NF 1.986ab G NF 0.384b TN NF 1.703b Norm 5.426b Norm 1.981b TN NF 0.384b G NF 1.702b TN NF 5.424b TN NF 1.981b Annual Data Only the MLPL shows significant differences. In this case, the TN no fit and normal are very close to each other, and are significantly better than either the TN fit or gamma fit. Gamma no fit and lognormal are not significantly

174 6. Non-Negative Distributions for Time Series 174 Tab. 6.5: Grouping Results for Monthly Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean TN NF 0.304a TN NF 2.199a G NF 5.709a LN 1.977a Norm 0.304a Norm 2.194a Norm 5.708ab G NF 1.975ab TN fit 0.304a TN fit 2.193a G fit 5.706ab G fit 1.975ab G fit 0.301a G fit 2.191a LN 5.704ab Norm 1.973bc G NF 0.299a G NF 2.189a TN fit 5.698ab TN fit 1.973bc LN 0.297a LN 2.187a TN NF 5.696b TN NF 1.971c worse than TN no fit or normal. Quarterly Data All measures have significant differences in ANOVA. However, TN no fit, lognormal and normal were always in the better group, and gamma fit in the worse group. Using means, it appears that TN no fit is slightly better than normal, which is in turn slightly better than gamma no fit, although gamma no fit is better than normal for the MASE, and is the best method for the MAPE. Monthly Data Only RPS and MLPL had any significant differences. For MLPL, TN no fit is significantly better than lognormal or either gamma option, but not significant versus normal. For RPS, there is only one significant difference,

175 6. Non-Negative Distributions for Time Series 175 with TN no fit significantly better than gamma no fit. However, the difference between normal and TN no fit is close to significant (T = 2.7, P = 0.07) Restriction to series where truncation is necessary The full normal distribution performs well on the overall M3 data because there are relatively few series where truncation is required. For this subsection, we restricted our dataset to the 20 total series (three annual, one quarterly and 16 monthly) where predictions from a normal distribution taken at the end of the fitting sample for the model selected by the AIC for each series gave at least one negative point forecast. Residual plots for this 20-series subset for the log MASE, log MAPE and log RPS appeared to be consistent with normality, but the residual plots for log MLPL were very non-normal, so we omitted log MLPL from the analysis. We obtained Table 6.6 for the other three performance measures. Table 6.6 shows that when the normal distribution gives a negative point prediction, the use of a non-negative distribution significantly improves the forecasting performance. While the normal distribution does better on the overall results, this table implies that, on the rare occasions where we get

176 6. Non-Negative Distributions for Time Series 176 Tab. 6.6: Grouping Results when restricted to 20 total series with at least one negative prediction on normal distribution Log MASE Log MAPE Log RPS Dist Mean Dist Mean Dist Mean Norm 0.072a Norm 4.285a G NF 6.561a G NF 0.162ab G NF 4.017ab Norm 6.519a LN 0.332bc LN 3.816bc LN 6.339ab TN fit 0.575c TN fit 3.628c G fit 6.137bc G fit 0.621c G fit 3.539c TN fit 5.988c TN NF 0.636c TN NF 3.522c TN NF 5.951c a negative point prediction on a positive-definite time series, using a nonnegative distribution will improve the forecasts for that time series. 6.6 Conclusions While the difference is not always significant, the truncated normal applied only to the forecasts is nearly always better than one applied to the fits as well as forecasts. An explanation is that, like the gamma distribution, the truncated normal will tend to have fits that are greater than the least squares fits if applied to the fitting sample. This is carried on into the future periods, and makes the truncated normal worse. Using the full normal distribution for the fits means that they are least squares fits, and the truncated normal

177 6. Non-Negative Distributions for Time Series 177 can then be applied in future periods without as much positive skew. The Gaussian distribution is never significantly worse in its performance than a truncation method. As a result, either the full normal or the truncated normal applied only to future observations are recommended. In the transform chapters, methods other than no transform were found to significantly improve forecasts, but the non-negative distributions described here do not significantly improve overall forecasts over the normal distribution. There are a few series that gave negative predictions using the full normal distribution, where non-negative distributions significantly improve forecasts.

178 7. TRANSFORMS COMBINED WITH TRUNCATIONS We have considered transforms of time series in Chapter 4, and truncations of time series in Chapter 6. In this chapter, adjustments to the transforms will be considered so that the real space y t is always positive. This chapter thus relates to using truncated transforms, which are described in Section 7.1. Results for these truncated transforms that use ANOVA are given in Section The Truncated Transforms This section describes how we amend the transforms to ensure that the resulting real space y t is always positive. All transforms described in Chapter 4 are considered, using the truncated normal distribution where a truncation is appropriate.

179 7. Transforms Combined with Truncations Log Transform If y t is the transformed space values then the conversion is done using y t = exp(y t ). Since this is positive for all real y, no truncation is necessary Box-Cox Transform The inverse transform from y t to y t is: y t = (λy t + 1) 1/λ, 0 < λ < 1. (7.1.1) We can ensure that y t is always positive by setting y t > 1/λ. Therefore, the truncation should be at 1/λ instead of 0. Alternatively, adding 1/λ to the mean and truncating at 0 will also work. There are three specifications of the Box-Cox parameter λ that have to be treated differently. 1. λ = 1 (no transform): We use the usual truncated normal here. 2. λ = 0 (log transform): No truncation is necessary.

180 7. Transforms Combined with Truncations < λ < 1: Here, we use the truncated normal on the transformed y t, setting y t > 1/λ. Then, the y t need to be transformed back into the real space y t using the inverse Box-Cox transform HSS Transform The distribution of y t can be expressed as: y t N (µ t, (c + µ t ) 2λ σ 2 ). (7.1.2) This can be truncated at µ t > 0 using a truncated normal distribution JETS Transform From Section 4.3.1, we have the following formulation for the Johnson Error Trend Seasonal (JETS) transform:

181 7. Transforms Combined with Truncations 181 µ t = w x t 1 ; e t = y t µ t ; ε t = 1 κ sinh 1 (κe t ); (7.1.3) x t = Fx t 1 +gε t ; ε t N(0, σ 2 ). For truncation, we want µ t > 0, so that e t > µ t. This means that we need ε t > 1 κ sinh 1 ( κµ t ). Since ε t is normal, this is where we truncate our distribution to ensure that all y t > 0. To find the truncated Johnson pdf, we assume that the random variable E has a Johnson distribution, where E represents e t in (7.1.3). We need to find the pdf of E conditional on E being greater than µ. From Section 4.3.1, the pdf of the Johnson distribution is, with φ( ) the standard normal pdf: f(e) = 1 σ 1 + (κe) 2 φ ( sinh 1 (κe) κσ ). (7.1.4) ( If we put u = 1 κσ sinh 1 (κe), then du/de = σ ) (κe) 2 Thus the cdf of the Johnson distribution, F (e), is such that F (e) = Φ(u), where Φ( ) is the standard normal cdf.

182 7. Transforms Combined with Truncations 182 We calculate the conditional Johnson pdf as: f(e E > µ) = = f(e) 1 F ( µ) ; ( 1 σ 1+(κe) φ sinh 1 (κe) 2 κσ ( 1 Φ sinh 1 (κµ) κσ φ(u) = σw 1 + (κe), 2 ( where W = 1 Φ sinh 1 (κµ) κσ ). ) ) ; (7.1.5) This means that the log pdf of the Johnson truncated distribution can be expressed as log f(e E > µ) = 1 2 [log(2π) + u2 + log(1 + (κe) 2 )] log(σ) log(w ). (7.1.6) As with the truncated normal, the symmetry of the Johnson distribution about zero means that 1 F ( µ) = F (µ); that is the probability of getting an observation greater than µ is equal to the probability of an observation less than µ. The latter is easier to calculate with mathematical software.

183 7. Transforms Combined with Truncations Truncated t Distribution From Section 4.3.3, the pdf of the t distribution with a continuous degrees of freedom (ν) is defined for all real z as: f(z) = 1 Γ( ν+1 νπ Γ( ν ) 2 2 ) ) ν+1 (1 + z2 2 ; (7.1.7) ν let F (z) be the corresponding cdf. Let z t = e t /σ for a particular time series. Suppose this can be modelled by the t distribution. If we want E > µ t, then we condition on Z > µ/σ. The conditional pdf is: f ( z Z > µ σ ) = f(z) 1 F( µ σ) for z > µ σ ; = f(z) F ( ) µ by symmetry; σ Γ ( ) ν+1 2 = Γ ( ) ( ) ν z 2 ν+1 2 F ( ) µ. ν σ νπ (7.1.8) As the ê t are initially calculated, we need to divide by ˆσ to get the standardised errors ẑ t. Thus, from Equations and , the log likelihood for

184 7. Transforms Combined with Truncations 184 the truncated t distribution is: ( log L = n log 1 Γ( ν+1 2 ) νπ n t=1 [ ν+1 2 log ) n log(ˆσ) Γ( ν 2 (1 ) ) + ẑ2 tν log ( F ( ˆµ t )) ]. ˆσ (7.1.9) 7.2 Results Using the M3 data, as previously described, the MASE, MAPE, RPS and MLPL for each series and truncation method were ascertained, then these results were logged. The logged data were then analysed with Minitab. The Fit results from the previous chapter were excluded, as they were generally worse than the no fit results, so only no fit results are included here; this means that the full non-truncated distribution is used for the estimation procedure, and only the predictions use truncation. The method codes are: Norm - standard normal, no truncation TN - Truncated Normal distribution Gam - Gamma distribution LN - Log Normal Distribution

185 7. Transforms Combined with Truncations 185 BC tr - Box Cox with a truncated normal distribution HSS p - truncated HSS with a proportional c approach HSS e - truncated HSS with an estimation approach tr J - JETS distribution truncated at 0 tr t - t distribution truncated at 0 Tables 7.1, 7.2 and 7.3 give grouping information from Minitab s ANOVA analysis, where two methods that do not share a common letter are significantly different. Results for each data frequency are then summarised. Tab. 7.1: Grouping Results for Annual Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean HSS p 0.699a HSS p 2.440a Gam 6.215a LN 2.243a HSS e 0.690a HSS e 2.432a LN 6.214a Gam 2.234a Norm 0.687a Norm 2.427a Norm 6.210a Norm 2.224ab Gam 0.686a Gam 2.426a tr t 6.209a TN 2.224ab LN 0.686a LN 2.425a TN 6.201a BC tr 2.196bc tr t 0.684a tr t 2.425a tr J 6.194a tr t 2.183cd TN 0.679a TN 2.420a HSS p 6.191a tr J 2.177cd tr J 0.675a tr J 2.418a HSS e 6.184a HSS e 2.154de BC tr 0.663a BC tr 2.405a BC tr 6.183a HSS p 2.143e Annual Data For annual data, there are no significant differences for either MASE, MAPE

186 7. Transforms Combined with Truncations 186 Tab. 7.2: Grouping Results for Quarterly Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean BC tr 0.408a BC tr 1.726a BC tr 5.446a LN 1.991a HSS e 0.402ab HSS e 1.721ab HSS p 5.437a Gam 1.986a HSS p 0.400ab HSS p 1.720ab HSS e 5.436a BC tr 1.984a tr J 0.385ab tr J 1.705ab LN 5.429a Norm 1.981ab Norm 0.385ab Norm 1.703ab Gam 5.429a TN 1.981ab LN 0.384ab LN 1.703ab Norm 5.426a HSS e 1.968bc Gam 0.384ab TN 1.703ab TN 5.424a tr J 1.968bc TN 0.384ab Gam 1.702ab tr t 5.423a tr t 1.964c tr t 0.378b tr t 1.698b tr J 5.423a HSS p 1.960c Tab. 7.3: Grouping Results for Monthly Data Log MASE Log MAPE Log RPS Log MLPL Dist Mean Dist Mean Dist Mean Dist Mean tr t 0.310a tr t 2.203a Gam 5.709a LN 1.977a TN 0.304ab TN 2.199ab Norm 5.708a Gam 1.975a Norm 0.304ab Norm 2.194ab tr t 5.706ab Norm 1.973a HSS p 0.301ab HSS p 2.191ab LN 5.704ab TN 1.971ab Gam 0.299ab HSS e 2.189ab TN 5.696abc HSS e 1.970ab HSS e 0.298ab Gam 2.189ab HSS p 5.690bc BC tr 1.965bc LN 0.297ab LN 2.187ab HSS e 5.685c HSS p 1.962c BC tr 0.296ab BC tr 2.183b BC tr 5.684c tr t 1.960c tr J 0.289b tr J 2.181b tr J 5.681c tr J 1.960c

187 7. Transforms Combined with Truncations 187 or RPS measures. The Box-Cox truncated distribution is the best for all three measures, but is not statistically different to any other method of truncation. However, on the MLPL measure, there are statistical differences, and here the truncated HSS proportional c approach is significantly better than any other method except the truncated HSS estimation method. The non-truncated normal distribution is significantly worse than four other methods. Quarterly Data For the MASE and MAPE measures, the truncated t distribution is significantly better than the truncated Box-Cox method, and there are no other significant differences for these measures. There are no significant differences on the RPS measure, but the MLPL measure shows that the truncated HSS proportional c method is significantly better than the normal distribution, and not significantly worse than any other method. Monthly Data For the MASE, the truncated JETS is significantly better than the truncated t, with no other significant differences. For the MAPE, both the truncated Box-Cox and truncated JETS are significantly better than the truncated t. For the RPS, four methods (truncated HSS proportional c, truncated

188 7. Transforms Combined with Truncations 188 HSS estimation, truncated Box-Cox and truncated JETS) are all significantly better than the normal distribution. The MLPL measure is similar to the RPS measure, with truncated HSS estimation replaced by truncated t in the methods that are significantly better than the normal distribution. 7.3 Conclusions The truncated HSS proportional c method is never significantly worse than any other measure, and often it is significantly better than other measures. As a result, it seems that it should be used rather than the standard normal distribution, when our data only consists of positive real numbers. Therefore, we recommend that it be used. These conclusions are similar to the conclusions in Chapter 4; truncation has not greatly impacted on the relative performance of the HSS proportional c approach.

189 8. AN INVESTIGATION OF OTHER TIME SERIES PROPERTIES 8.1 Introduction Rather than focus on which transform or truncation is best, for this chapter our focus will be on how various time series properties affect prediction performance. Studies of this kind have been made by Reid (1972), Shah (1997), Meade (2000) and Petropoulos et al. (2014). Reid (1972) applied four models to 113 time series, and concluded that the Box-Jenkins approach was best; however, the non-seasonal or seasonal characteristics of the selected time series made a difference in which models were best. Shah (1997) used the 203 quarterly series from the M-competition, a pre-cursor to the M3-competition. He used a training sample method to

190 8. An Investigation of Other Time Series Properties 190 ascertain which time series features were best suited to forecasting with particular models, and then used the features of non-training sample time series to select the best model; this individual selection method was found to be better than aggregate selection. Meade (2000) used a simulation study, in which the simulated series had known properties, to find the best forecast methods by analysing these properties. These methods were applied to the M-competition, and it was found that this method produced good forecasts, but not necessarily the best. Graff et al. (2013) used a machine learning classification approach to investigate the performance of forecasting models, based on a relatively large set of covariates, where the covariates were statistical properties of the series used in both the M and M3-competitions. Petropoulos et al. (2014) used both fast-moving and intermittent data simulations to determine which forecasting methods are best suited to which types of data. Regression analysis was used in this paper. Conclusions were much as expected: more random series produce worse forecasts, longer forecast horizons are worse. The strengths of the approach in this chapter, relative to this previous work,

191 8. An Investigation of Other Time Series Properties 191 are that real data are used, rather than simulated series. Relative to Reid (1972) and Shah (1997), who did use actual time series, the date set used here, namely, the M3-competition data, has a substantially larger sample size, for example, about three times as many series as the M-competition used by Shah (1997). While Graff et al. (2013) used the M3-competition, they assessed different forecasters methods, and not the exponential smoothing models that this thesis focuses on. Properties of time series include the type of series for the M3 data, and the length of each series fitting sample. Derived properties include the number of outliers when fitted by the model selected by the AIC, and the scaled standard deviation of that model. These four properties are used as explanatory variables. We selected these four variables because they can be easily calculated for all series given knowledge of the model selected by the AIC. For this chapter, our response variable was the log of the MAPE of the model selected by the AIC on non-transformed series (log MAPE ID). We chose this because the MAPE is a scale-independent measure, and using no transforms means that our analysis can be more easily replicated by others. We still used the rolling origin method.

192 8. An Investigation of Other Time Series Properties 192 An explanation of the explanatory variables follows: Category: The M3 time series can be classified into six categories: micro, macro, finance, industry, demographic and other. This classification is distinct from the data frequency, so any time series can be classified into these six categories, though there are no other series for quarterly data. In the general linear model used to analyse the role of the explanatory variables in prediction performance, category is a factor. The categorisation of the M3 data was intentionally vague because it was associated with a competition. The categories are largely self-explanatory, with the possible exception of micro and macro, which refer to micro-economic and macro-economic data. Fitting sample: The number of observations used to fit each series; this is a covariate. Scaled sigma: Because the analysis needs to be scale-independent, ˆσ must be scaled. We used the estimated standard deviation of the model selected by the AIC divided by the overall mean of the fitting sample of the time series to scale; this was then multiplied by 100, so that scsig = 100ˆσ mod /ˆµ sam. The scaled sigma is a covariate.

193 8. An Investigation of Other Time Series Properties 193 Number of outliers: The number of points in the fitting sample that are more than 3ˆσ away from what would be predicted at that particular point using the model selected by the AIC. Although it is possible to treat this as a covariate, we treated it as a factor because there are at most five different values of this variable. For annual data, there were three series with two outliers; these series were combined with the one outlier series. For monthly data, there were three series with five outliers, which were similarly combined with the four outlier series. Table 8.1 shows the distribution of outliers for each frequency, both originally and after combining. Table 8.2, which is sourced from Table 1 in Makridakis and Hibon (2000), shows the numbers of series in each category for each frequency. Table 2.2 gave statistics on the fitting sample lengths of the M3 time series. 8.2 Why MAPE Instead of MASE was Used We would intuitively expect that scaled sigma would have a positive correlation with MASE or MAPE, since a series that can be fitted easily will

194 8. An Investigation of Other Time Series Properties 194 Tab. 8.1: Outlier Distribution for Each Frequency. Original distributions and distributions after combining are given. There was no combining for quarterly data. Number of Outliers Frequency Total Annual Original Quarterly Monthly Original Frequency 0 1 Total Annual Combined Frequency Total Monthly Combined Tab. 8.2: Data Categories for Each Frequency Data Category Demo- Freq Micro Industry Macro Finance graphic Other Annual Quarterly Monthly

195 8. An Investigation of Other Time Series Properties 195 usually be more predictable. The M-competitions have found that complex models that perform well on the fit do not perform on forecasting, but this is a different argument to the proposition that series that are easily fitted should be more predictable. It was therefore surprising that plotting log MASE against log scaled sigma for monthly data, as in Figure 8.1, showed a negative correlation that was highly significant. Yet when we plotted log MAPE against log scaled sigma as in Figure 8.2, the expected positive correlation was obtained. These graphs establish that we should use log MAPE instead of log MASE as our response variable. As there are outlier points in the graphs in this section that could have a strong influence on the Pearson correlation coefficient, we used Spearman s ρ instead. Both the MAPE and MASE use the absolute prediction errors divided by a scaling factor. For the MAPE, the scaling factor is simply the observed value, but for the MASE the scaling factor, known as the Mean Absolute

196 8. An Investigation of Other Time Series Properties 196 Fig. 8.1: Log MASE ID vs log scaled sigma for Monthly Data: Spearman s ˆρ = 0.545

197 8. An Investigation of Other Time Series Properties 197 Fig. 8.2: Log MAPE ID vs log scaled sigma for Monthly Data: Spearman s ˆρ = 0.877

198 8. An Investigation of Other Time Series Properties 198 Error (MAE), from (2.10.1) is: MAE = 1 n 1 n y t y t 1. (8.2.1) t=2 The problem with using the MAE as the denominator is that it is highly correlated with scaled sigma, as shown in Figure 8.3. This is expected because both the MAE and the scaled sigma are measures of the spread of the fitting sample. With h = the number of withheld observations, let MFE = 1 h h j=1 y n+j ŷ n+j be the mean of the future absolute errors. As we are taking logs of both MASE and scaled sigma, we effectively have (log MFE log MAE) vs log scaled sigma. As explained above, log MFE and log MAE should both be expected to correlate positively with log scaled sigma. However, the subtraction of log MAE causes some cancellation of this correlation. For annual and quarterly data, we are left with very small correlation between log MASE and log scaled sigma. For monthly data, we have the negative correlation shown in Figure 8.1. For monthly data, the log MASE and log MAPE actually have a small neg-

199 8. An Investigation of Other Time Series Properties 199 Fig. 8.3: Log MAE vs log scaled sigma for Monthly Data: Spearman s ˆρ = 0.918

200 8. An Investigation of Other Time Series Properties 200 ative correlation (shown in Figure 8.4), while quarterly and annual data exhibit clear positive correlation between log MASE and log MAPE. Since the numerators of both MASE and MAPE have the same absolute prediction errors, this implies that the denominators are negatively correlated for monthly data, but not for annual and quarterly data. A possible explanation is that more volatile monthly series, which have larger MAEs, are more likely to have relatively small observations on withheld data; this results in these series having relatively large MAPEs but small MASEs. From Equation , MAPE = 100 h h ( ) yn+j ŷ n+j. (8.2.2) y n+j j=1 Unlike the MASE, the MAPE uses element-by-element division of one vector by another vector; the MASE divides the vector of absolute prediction errors by the scalar MAE. As a result, the mean of the withheld sample may have less influence on the MAPE than the minimum of the withheld sample, since a large error divided by a relatively small observation will result in a very large APE for that point.

201 8. An Investigation of Other Time Series Properties 201 Fig. 8.4: Log MASE vs log MAPE for Monthly Data: Spearman s ˆρ = 0.202

8. An Investigation of Other Time Series Properties 202 Fig. 8.5: Log MAE vs log Mean Future Observations for Monthly Data: Spearman s ˆρ = 0.302 Figures 8.5 and 8.

202 8. An Investigation of Other Time Series Properties 202 Fig. 8.5: Log MAE vs log Mean Future Observations for Monthly Data: Spearman s ˆρ = Figures 8.5 and 8.6 respectively show plots of log MAE vs log mean future (withheld) observations and log MAE vs log minimum future observation for monthly data. Both graphs show negative correlations that are larger than for either annual or quarterly data. This explains why the log MASE and log MAPE have a small negative correlation for monthly data.

203 8. An Investigation of Other Time Series Properties 203 Fig. 8.6: Log MAE vs log Minimum Future Observations for Monthly Data: Spearman s ˆρ = 0.573

Automatic forecasting with a modified exponential smoothing state space framework

ISSN 1440-771X Department of Econometrics and Business Statistics http://www.buseco.monash.edu.au/depts/ebs/pubs/wpapers/ Automatic forecasting with a modified exponential smoothing state space framework