NATCOR. Forecast Evaluation. Forecasting with ARIMA models. Nikolaos Kourentzes

NATCOR Forecast Evaluation Forecasting with ARIMA models Nikolaos Kourentzes n.kourentzes@lancaster.ac.uk

O u t l i n e 1. Bias measures 2. Accuracy measures 3. Evaluation schemes 4. Prediction intervals 5. Parameter selection 6. Method selection 2

F o r e c a s t E v a l u a t i o n Forecast errors Inaccurate results Loss (performance, financial etc...) Measuring the loss is important but often hard to track Forecasting error can be used as a proxy. Therefore is it important to track and evaluate forecast errors. Forecast evaluation = key activity in the forecasting process. It is at the core of: Important for Forecast Monitoring Essential for Method Selection and Parameterisation 3

F o r e c a s t E r r o r - D e f i n i t i o n 8 7 6 Sales (A t ) Forecast (F t ) SKU A Sales 5 4 3 e 4 e 45 2 1 e 2 214.1 214.11 214.21 214.31 214.41 Week e t A t F t e 51 Positive errors Negative errors Error 5 25-25 -5 214.1 214.11 214.21 214.31 214.41 Week Underforecast Overforecast 4

M e a s u r e s o f B i a s Instead of considering the complete vector of errors we can aggregate them using: 1 ME n MdE n t1 e median ( e t t ) Mean Error: Most common measure of forecast bias Median Error Measures of bias show whether we typically over- or under-forecast. Ideally this should be as close to zero as possible. Forecasting method A ME: 12.67 Forecasting method B ME: -23.12 8 Positive bias Under-forecasting 6 4 2 214.1 214.11 214.21 214.31 214.41 Forecasting method C ME: -.76 8 Negative bias Over-forecasting 8 6 4 2 Unbiased 214.1 214.11 214.21 214.31 214.41 6 4 2 214.1 214.11 214.21 214.31 214.41 5

M e a s u r e s o f B i a s Error Sales Mean Error Mean error = 149.9 In this case we typically forecast more than what we should. This forecast will lead to biased decisions. 214.1 214.11 214.21 214.31 214.41 Error Sales Mean Error Mean error =.1 This forecast shows no preference, therefore it is useful for objective decision making. 214.1 214.11 214.21 214.31 214.41 Error Sales Mean Error Mean error = -15.1 In this case we typically forecast less than what we should. This forecast will lead to biased decisions. 214.1 214.11 214.21 214.31 214.41 6

A N o t e o n M e a n a n d M e d i a n E r r o r s It is well known that the mean is affected by outliers and asymmetric distributions more than the median. In the context of forecasting: Median insensitive to extremes (outliers), summarises better normal performance. Mean sensitive to extremes (outliers), useful when we are interested in them. Substantial differences between mean and median errors error distribution may have outliers or be asymmetric. Distribution of errors 2 18 Mean Error Median Error Frequency 16 14 12 1 8 6 4 2 Outliers affect mean strongly -8-6 -4-2 2 4 6 8 Error 7

B i a s a n d M a g n i t u d e o f E r r o r s - A c c u r a c y Mean Error = does not tell us if we are accurate, merely whether we are biased. To overcome this we can calculate squared (e t2 ) or absolute errors ( e t ), which do not cancel out once aggregated. e t e t 2 e t e 1 e 2 e 3-7 + 12-5 + 49 + 144 + 25 7 12 5 Sum + 218 24 Mean + 72.67 8 No error? 8

M e a s u r e s o f A c c u r a c y S c a l e D e p e n d e n t Some common errors that can be defined using these operators are: 1. Mean Squared Error (MSE) Sensitive to outliers (squares) Non-intuitive (units are squared) Scale dependent 2. Root Mean Squared Error (RMSE) As MSE but resulting units not in squares Scale dependent 3. Mean Absolute Error (MAE) Robust to outliers Scale dependent RMSE MSE MAE 1 n 1 n A i F i 1 n n i1 n A i F i i1 n i1 A i F i 2 2 Scale dependent errors can only be used to compare different methods on the same time series. If a time series is in cars errors are also in cars! Similar problems occur due to the scale. Should not be used for comparisons across different time series! Obviously we can define median versions of the above errors. 9

A N o t e o n A b s o l u t e a n d S q u a r e d E r r o r s Distribution of errors (e t ) Distribution of absolute errors ( e t ) Distribution of absolute errors (e t 2 ) Frequency 2 15 1 Mean Median Frequency 35 3 25 2 15 Mean Median Frequency 1 8 6 4 Notice how extreme the outliers become Mean Median 5 1 5 2-5 5 Error 2 4 6 8 Absolute error 2 4 6 8 Squared error Mean: 158.8 units Median: 117.8 units Mean: 5,179 units 2 Median: 14,22 units 2 Mean: 224. units Median : 118.4 units Squared errors are sensitive to outliers, as they are increased disproportionally to smaller errors. On the other hand, absolute errors do not rescale the errors and the contribution of outliers is not exaggerated. 1

P e r c e n t a g e E r r o r s S c a l e I n d e p e n d e n t In order to compare across different time series we define a series of scale independent measures; neither the level or the units of the original time series are important. Percentage errors (PE) Expresses errors as a ratio to actual level Free of units Requires a meaningful zero ( o C is not a meaningful zero), so that the actuals do not become negative. PE t A t A t F t Based on the percentage errors we can define percentage bias and accuracy metrics. These will be scale and unit independent and therefore allow comparisons and aggregations between time series. 11

P e r c e n t a g e E r r o r s S c a l e I n d e p e n d e n t Some common errors that can be defined using these operators are: 1. Mean Absolute Percentage Error (MAPE) Scale independent Very intuitive (method is % wrong) MAPE Biased: Positive and negative errors do not count equally! Requires non-zero and positive denominator 1 n n i1 A i A i F i 2. Symmetric Mean Absolute Error (smape) If you see it, avoid it! Has too many several issues! There are also median versions of the absolute percentage errors. MAPE bias example Actual = 1 Forecast = 9 MAPE = 1 /1 = 1% Actual = 9 Forecast = 1 MAPE = -1 /9 = 11.111% 12

P e r c e n t a g e E r r o r E x a m p l e Period Actual t Forecast t e t t A t F t = A t - F t t+1 16 11 5 t+2 1 11-1 t+3 12 11 1 Period Actual t Forecast t e t t A t F t = A t - F t t+1 16 11 5 t+2 11-11 t+3 12 11 1 PE APE = e t /A t = PE 4.72% 4.72% -1.% 1.%.98%.98% PE APE = e t /A t = PE 4.72% 4.72% Infinite Infinite.98%.98% MAPE = 2.23% MAPE = Infinite! If there is a zero (or a value close to zero) MAPE becomes infinite (or extremely large) Medians will typically allow you to calculate MAPE as infinite errors are ignored! 13

E x a m p l e o f S c a l e I n d e p e n d e n t E r r o r s 8 6 Sales (A t ) Forecast (F t ) SKU A MAE = 134. computers MAPE = 179.7% Sales 4 2 6 5 4 214.1 214.11 214.21 214.31 214.41 Week SKU C Sales (A t ) Forecast (F t ) The MAE of the first series is dwarfed from the second s, because the scale of the second series is in millions. MAE has also the issue of units. Sales 3 2 1 214.1 214.11 214.21 214.31 214.41 Week MAE = 6,117,39 iron nails MAPE = 26.6% 14

R e l a t i v e E r r o r s In order to compare against a benchmark method we define a series of relative error measures. Relative errors (RE) Expresses errors as a ratio to the errors of another forecasting model, typically the naive. Free of units (scale independent) Directly compares forecasting methods RE t A t At Ft F Benchmark Geometric Mean Relative Absolute Error (GMRAE) Absolute form of relative errors Error < 1 method better than benchmark Error > 1 method worse than benchmark Error = 1 method as good as benchmark GMRAE n i1 A i A i F F i Naive 1 n To summarise across time series we use again a GM of the GMRAE for each series 15

R e l a t i v e S u m m a r y E r r o r s Another way to calculate GMRAE is to use mean of logarithms of RAEs. GMRAE n i1 A i A i F F i Naive 1 n exp 1 n n i1 log A i A i F F i Naive An alternative error metric is AvRelMAE (Average Relative MAE). This is calculated as follows: Av RelMAE n i1 MAE MAE Forecast Benchmark 1 n exp 1 n n i1 log MAE MAE Forecast Benchmark The idea is the following: 1. Calculate MAE for each series (i = 1,,n) 2. Calculate ratios with benchmark MAE 3. Average across different series with a GM (as we use ratios) More robust to calculate than GMRAE, but less sensitive to individual errors Same interpretation as GMRAE 16

E r r o r M e a s u r e R e m a r k s Can be calculated across any forecast horizon or aggregation of forecast horizons. Scale independent errors can be aggregated across time series as well. There is no best error measure! Depends on the data and question at hand. Error measures can help in assessing the bias, accuracy and robustness of a forecasting method, as well as its ranking against other methods. Different forecasting error measures may output different model rankings. 17

F o r e c a s t M o n i t o r i n g Let us assume that we run two different forecasts and tracked their performance at every period. The blue forecast is more reactive and therefore better. But if this SKU was forecasted every month automatically without any human intervention, could we identify the problem? 6 5 4 Sales SES(.5) SES(.25) SKU B - SES Sales 3 Out-ofsample t+1 to t+6 forecast error MAE 2 1 3 2 1 Very high errors should signal an alert for manual intervention 214.1 214.11 214.21 214.31 214.41 SES(.5) Week SES(.25) 214.1 214.11 214.21 214.31 214.41 Week The alert occurs very fast The poor forecast produces very high errors for prolonged periods 18

F o r e c a s t M o n i t o r i n g We can monitor automated forecasts by tracking their errors. This can be implemented unstructured or in a control chart approach. The errors can be used raw or smoothed. Periods with unexpectedly high errors 2 15 1.96 MAE 1 5 5 1 15 2 25 3 35 4 45-1.96 Period 19

O u t l i n e 1. Bias measures 2. Accuracy measures 3. Evaluation schemes 4. Prediction intervals 5. Parameter selection 6. Method selection 2

In- a n d O u t - of- S a m p l e The historical observations can be split into two subsets: In-sample: used for model building and parameterisation. Out-of-sample: used for model evaluation. This is not used in building the model and is not seen by the model. We use it to simulate true forecasts, instead of waiting for new unobserved values in order to evaluate the forecasting performance of alternative forecasting models. 65 6 In-sample Out-of-sample 55 Units 5 45 4 Use to build 35 the model 1 2 3 4 5 6 7 8 Month Forecast origin Use to evaluate the model. Note that forecast is multiple steps ahead 21

S t a t i c O r i g i n E v a l u a t i o n The simplest evaluation produce a single forecast in the out-of-sample subset. Let the forecast horizon be 12 months and the holdout (out-of-sample) 24 months: 65 6 55 Units 5 45 4 35 1 2 3 4 5 6 7 8 Month Forecast origin We have a forecast for t+1, t+2,..., t+12. There is a single forecast origin, month 6. Let us assume that we are interested to forecast accurately t+12... We have only 1 measurement: low confidence in our accuracy measurement This evaluation scheme is called static origin evaluation. 22

S t a t i c O r i g i n E v a l u a t i o n Use to fit the model In-sample Use to evaluate the model Out-of-sample (Holdout) Origin Forecasts Limitations: 1. One forecast per lead time needs long track history 2. Forecast susceptible to corruption strange origins or targets may affect quality of forecasts 3. Averaging over different lead times corrupts summary error statistic 23

R o l l i n g O r i g i n E v a l u a t i o n A way to overcome these limitations is the Rolling Origin Evaluation scheme. In-sample Out-of-sample Roll origin Origin In-sample (increased) Out-of-sample Roll origin Origin In-sample (increased) Out-of-sample Origin We roll the forecast origin until there is not enough out-of-sample to use for evaluation.

R o l l i n g O r i g i n E v a l u a t i o n Holdout: 5 Holdout: 1 Forecast Lead Fixed Origin Rolling Origin Fixed Origin Rolling Origin Time Number of forecasts Number of forecasts Number of forecasts Number of forecasts t+1 1 5 1 1 t+2 1 4 1 9 t+3 1 3 1 8 t+4 1 2 1 7 t+5 1 1 1 6 Rolling origin evaluation: 1. Provides more forecasts per origin 2. Overcomes limitations of fixed origin evaluation Provides more forecasting history per lead time for equal holdout sample Does not need to average over lead time Can overcome strange origins or targets 25

Alpha =.2 Alpha =.5 R o l l i n g O r i g i n E v a l u a t i o n Using the previous example a rolling origin evaluation would look like: 65 6 55 In-sample Out-of-sample Black dots are forecast origins 5 45 4 35 65 6 55 5 45 4 1 2 3 4 5 6 7 8 In-sample Out-of-sample Visualising the rolling origin forecasts makes it easier to appreciate the importance of smooth forecasts that filter noise 35 1 2 3 4 5 6 7 8 26

E v a l u a t i o n S c h e m e s S a m p l e S i z e Having enough sample for model building and evaluation is crucial. Lack of sample severely restricts the selection of alternative models, as many require abundance of data. Eventually on really short time series we can only apply naive and simple average models. Models that require parameterisation perform better when there is ample data. The more available data the better the estimation of the parameters. Consider issues with setting the gamma parameter on seasonal exponential smoothing models. The same is true for evaluating forecasts. With large sample many errors can be calculated and therefore higher confidence on the estimated figure. Sample size also affects our understanding of time series components. How many observations are required to identify a seasonal time series? 27

E v a l u a t i o n S c h e m e s S a m p l e S i z e Lets try to find the t+1 forecast error of a method using absolute errors... For 84 errors the MAE is 45.75, but let us assume that this is unknown. The 1 st error is: 65.96 The 2 nd error is: 68.18 The 3 rd error is: 99.35... The 2 th error is: 112.68 The 3 th error is: 18.33 - Mean: 65.96, are we confident? - Mean: 67.7, are we confident? - Mean: 77.83, are we confident? - Mean: 57.3, are we confident? - Mean: 49.65, are we confident? Absolute Error (AE) 15 1 5 AE Cumulative MAE Final MAE 2 15 1 5 MAE: 45.75 5 1 15 2 25 3 35 4 45 Observation 5 1 15 28

E v a l u a t i o n S c h e m e R e m a r k s Forecasting accuracy of one model is meaningful only as a relative size to another model or benchmark. An error of 5% or 5% is non-informative without comparing it to benchmarks. Naive methods make simple and effective benchmarks. There are no set rules determining the size of the out-of-sample (or holdout), however: o o o o It should be at least as long as the forecast horizon Leave enough in-sample data for model building Provide enough forecasts of the forecast horizon of interest A simple heuristic is 8% in-sample 2% out-of-sample, but often is inappropriate. 29

O u t l i n e 1. Bias measures 2. Accuracy measures 3. Evaluation schemes 4. Prediction intervals 5. Parameter selection 6. Method selection 3

P r e d i c t i o n I n t e r v a l s 7 65 6 55 5 45 4 35 Data SES 1 2 3 4 5 6 7 8 We can use the error distribution of valid models to formulate prediction intervals of the forecasting methods Probability 15 15 Error PDF.1.2.3.4.5 1 5-5 -1 e t =A t -F t 1 5-5 -1-3-2-2 3 99.7% 95.5% 68.2% -15 1 2 3 4 5 6 7 8-15 5 x 1-3 31

P r e d i c t i o n I n t e r v a l s Starting from the sample standard deviation of the errors: se h 1 n n i1 2 e e th, i th For an unbiased model the mean of errors is zero. Probability.5.4.3.2 99.7% 95.5% 68.2% se h 1 n n et h, i MSEh i1 The forecast is the expected value. Adding and subtracting to the forecast the quantity z α/2 s e h gives us the prediction intervals. The standard score for normal distributions (valid models) is easy to calculate. 2 PI h Ft h za / 2 s e h.1-3-2-2 3 Prediction Interval z α/2 - score 5%.67 9% 1.64 95% 1.96 99% 2.58 Using prediction intervals we can visualise the confidence in our forecasts. 32

P r e d i c t i o n I n t e r v a l s SKU C SKU A 8% and 9% prediction intervals UK Android Market Share US Air Passengers Observe that the prediction intervals vary with the series, method and horizon. We have more confidence to forecasts with tight PIs. 33

P r e d i c t i o n I n t e r v a l s Calculating the PI formula can be complicated as it requires in-sample MSE of multiple forecast horizons (h): PI h Ft h za/ 2se Ft h z h a/ 2 MSE h This can be obtained by calculating the rolling origin in-sample MSE of the relevant forecast horizon. Alternatively it can be approximated using the following formula: se h h MSEt1 This is the square root of the horizon multiplied by the squared root of the 1-step ahead in-sample mean squared error. Note that there is substantial empirical evidence that this is a very rough approximation. 34

O u t l i n e 1. Bias measures 2. Accuracy measures 3. Evaluation schemes 4. Prediction intervals 5. Parameter selection 6. Method selection 35

P a r a m e t e r S e l e c t i o n We have seen for the exponential smoothing methods that we can select the smoothing parameters based on the theoretical properties of the methods and the characteristics of the time series: Low parameters imply long weighted averages and therefore robustness against outliers and increased noise. Higher parameters imply shorter weighted averages, reacting faster to new information and handling better breaks in the series. However it may be desirable to automate the parameter selection process. We can use insample error metrics for this purpose. 36

P a r a m e t e r S e l e c t i o n 8 7 6 Sales (A t ) Forecast (F t ) SKU A - SES =.1 8 7 6 Sales (A t ) Forecast (F t ) SKU A - SES =.4 5 5 Sales 4 3 Sales 4 3 2 2 1 1 214.1 214.11 214.21 214.31 214.41 Week 214.1 214.11 214.21 214.31 214.41 Week This parameter is better, resulting in a better fit (lower error). MSE = 27,337 < MSE = 35,58 Based on this idea we can optimise model parameters (for any exponential smoothing method or generally). 37

P a r a m e t e r S e l e c t i o n : S E S e x a m p l e We can calculate the in-sample MSE for various values of alpha and identify the value that gives the lowest error. 134 132 MSE Minimum In-sample MSE 13 128 126 124 α =.16 122 12.1.2.3.4.5.6.7.8.9 1 Alpha This result is very close to the one we chose manually (.1). Obviously for more complex methods we can optimise several parameters simultaneously. 38

P a r a m e t e r S e l e c t i o n : S E S e x a m p l e The same principle can be applied to choose the initialisation level value as well. Now we vary both initialisation and smoothing parameter. 6 5 Sales (A t ) Forecast (F t ) SKU B - SES 4 Sales 3 2 1 214.1 214.11 214.21 214.31 214.41 Week Level = 1697 Alpha =.3237 As the number of parameters (including initialisation values) increases optimisation becomes more time consuming and requires more data. 39

P a r a m e t e r S e l e c t i o n Will the optimal parameters be always the best? In fact no. There are many reasons for this: Optimisation is done in-sample. The correlation between in-sample error and out-of-sample has been shown to be low. Optimisation is (typically) done using t+1 MSE. In practice we forecast for longer horizons and our pragmatic cost functions are different. MSE by construction is very reactive to extreme errors, which may distort the error surface that we are search for the optimal values. Sample limitations as the number of parameters to optimise increases. Minimum error may not be the business objective. Companies may prefer consistency of forecasts across origins instead. Optimisation is very useful for automation, however human experts should override identified parameters if they violate theory or objectives. 4

P a r a m e t e r S e l e c t i o n T r y i t o u t Experiment with setting the alpha parameter: Do you agree with the optimal value? Does the in- and out-of-sample behave the same way? https://kourentzes.shinyapps.io/shinyses 41

P a r a m e t e r S e l e c t i o n R e m a r k s Optimisation for complex models is sensitive to the starting conditions (local optima). Different sets of initial values and parameters may give different results. Optimising on bias does not make sense, as positive and negative errors cancel out. Optimisation results may change depending on the error metric used. MSE is common, but other metrics may be useful. Initial level 3 25 2 15 log(mse) Minimum MAE Minimum MSE Minimum MAPE 1.2.4.6.8 1 Alpha 15.5 15 14.5 14 13.5 Sales 6 5 4 3 2 1 Error Alpha Level MSE.3238 1697 MAE.3423 246 MAPE.2574 246 SKU B - SES 214.1 214.11 214.21 214.31 214.41 Week MAE MSE MAPE 42

M e t h o d S e l e c t i o n We can use similar principles to select the appropriate forecasting method. There are two major approaches: Using information criteria (usable only within a family of methods, e.g. exponential smoothing). Using a validation holdout sample and rolling origin evaluation. These compliment manual selection, based on understanding the characteristics of a time series. 43

M e t h o d S e l e c t i o n W h y F i t E r r o r s D o N o t W o r k For selecting between methods we cannot use the in-sample errors as we did we parameter selection. This is due to the fact that more complex models will tend to have lower fit errors, even if their forecasts perform worse. 6 5 4 Sales Level EXSM Trend-Season EXSM SKU B Sales 3 2 1 In-sample Holdout 213.48 214.1 214.6 214.11 214.16 214.21 214.26 214.31 214.36 214.41 Week Level Exponential Smoothing In-sample MAE t+1 : 77.56 Out-of-sample MAE t+1-t+6 : 321.28 < < Trend-Seasonal Exponential Smoothing In-sample MAE t+1 : 528.55 Out-of-sample MAE t+1-t+6 : 195. More complex models are more flexible and have higher potential to overfit compared to simpler models. 44

M e t h o d S e l e c t i o n I n f o r m a t i o n C r i t e r i a One approach is to penalise the fit of more complex model for the number of parameters they have (= complexity). We define as information criteria: IC ln( MSE) pq( n) ln(mse) is the logarithm of the 1-step ahead in-sample MSE. p is the number of parameters (including initial values). Q(n) is the penalty function. n is the in-sample size. Akaike Information Criterion (AIC) Uses the following penalty function: Bayesian Information Criterion (BIC) Uses the following penalty function: Q ( n) 2/ n Q ( n) ln( n) / n BIC penalises more larger models. For exponential smoothing no significant differences in performance. 45

I n f o r m a t i o n C r i t e r i a E x a m p l e 6 5 4 Sales Level EXSM Trend-Season EXSM SKU B Sales 3 2 1 In-sample Holdout 213.48 214.1 214.6 214.11 214.16 214.21 214.26 214.31 214.36 214.41 Week Level Exponential Smoothing In-sample MSE t+1 : 812,56.27 AIC: 13.779 BIC: 13.7923 Out-of-sample MAE t+1-t+6 : 321.28 < < Trend-Seasonal Exponential Smoothing In-sample MSE t+1 : 413,584.9 AIC: 13.8326 BIC: 14.5926 Out-of-sample MAE t+1-t+6 : 195. Both information criteria give us the correct answer. 46

M e t h o d S e l e c t i o n H o l d o u t s a m p l e For the other approach we simply measure the error at a subset of the series that is not used for fitting the models. Level Exponential Smoothing In-sample MAE t+1 : 77.56 Validation MAE t+1-t+6 : 655.64 Out-of-sample MAE t+1-t+6 : 321.28 Trend-Seasonal Exponential Smoothing In-sample MAE t+1 : 528.55 Validation MAE t+1-t+6 : 128.61 Out-of-sample MAE t+1-t+6 : 195. Again we get the correct answer! Validation set: Rolling origin forecasts Out-of-sample: Rolling origin forecasts 47

M e t h o d S e l e c t i o n Information Criteria Pros - Easy to calculate. Cons - Applicable only within a single family of methods. - Cannot always be aligned with the true cost function of the company. Holdout Set Pros - Universal, can be used to select between any methods. - Can be fully aligned with the true cost function Cons - Lose sample for the validation set. - If sample size is not adequate for a reasonable rolling origin evaluation then the results may not be reliable. - Computationally complex. 48

M e t h o d a n d P a r a m e t e r S e l e c t i o n R e m a r k s Forecast evaluation can help us to automate both method and parameter selection. There are several alternative options often they produce similar results Reliable fully automatic performance (use forecast monitoring) The key benefit of statistics is automation essential for modern business forecasting problems Experienced human experts can outperform automatic methods Understand structure of the time series Choose method and parameters appropriately 49

Thank you for your attention! Questions? Nikolaos Kourentzes Lancaster University Management School Lancaster Centre for Forecasting - Lancaster, LA1 4YX email: n.kourentzes@lancaster.ac.uk Forecasting blog: http://nikolaos.kourentzes.com www.forecasting-centre.com/ Full or partial reproduction of the slides is not permitted without author s consent. Please contact n.kourentzes@lancaster.ac.uk for more information.