Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning

Size: px

Start display at page:

Download "Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning"

Zoe Harper
6 years ago
Views:

1 Long Term Time Series Prediction wit Multi-Input Multi-Output Local Learning Gianluca Bontempi Macine Learning Group, Département d Informatique Faculté des Sciences, ULB, Université Libre de Bruxelles 1050 Bruxelles - Belgium gbonte@ulb.ac.be Abstract. Existing approaces to long term time series forecasting are based eiter on iterated one-step-aead predictors or direct predictors. In bot cases te modeling tecniques wic are used to implement tese predictors are multi-input single-output tecniques. Tis paper discusses te limits of single-output approaces wen te predictor is expected to return a long series of future values and presents a multi-output approac to long term prediction. Te motivation for tis work is te fact tat, wen predicting multiple steps aead of a time series, it could be interesting to exploit te information tat a future series value could ave on anoter future value. We propose a multi-output extension of our previous work on Lazy Learning, called LL-MIMO, and we introduce an averaging strategy of several long term predictors to improve te final accuracy. In order to sow te effectiveness of te metod, we present te results obtained on te tree training time series of te ESTSP 08 competition. 1 Introduction A regular time series is a sequence of measurements t of an observable at equal time intervals. Bot a deterministic and a stocastic interpretation of te forecasting problem on te basis of istorical dataset exist. Te deterministic interpretation is supported by te well-known Takens teorem [13] wic implies tat for a wide class of deterministic systems, tere exists a diffeomorpism (one-to-one differential mapping) between a finite window of te time series { t 1, t 2,..., t m } (lag vector) and te state of te dynamic system underlying te series. Tis means tat in teory it exists a multi-input singleoutput mapping (delay coordinate embedding) f : R m R so tat: t+1 = f( t d, t d 1,..., t d m+1 ) (1) were m (dimension) is te number of past values taken into consideration and d is te lag time. Tis formulation returns a state space description, were in te m dimensional space te time series evolution is a trajectory, and eac point represents a temporal pattern of lengt m. Te representation (1) does not take into account any noise component, since it assumes tat a deterministic process f can accurately describe te time series. Note, owever, tat tis is only a possible way of representing te time series penomenon and tat alternative representations sould not be discarded a priori. In fact, once we assume tat we ave not access to an accurate model of te function f, it is reasonable to extend te deterministic formulation (1) to a statistical Nonlinear Auto Regressive (NAR) formulation [8] t+1 = f ( t d, t d 1,..., t d m+1) + w(t) (2)

2 were te missing information is lumped into a noise term w. In te rest of te paper, we will ten refer to te formulation (2) as a general representation of te time series wic includes as particular instance te case (1). Te success of a reconstruction approac starting from a set of observed data depends on te coice of te ypotesis tat approximates f, te coice of te order m and te lag time d. In tis paper we will address only te problem of te modeling of f, assuming tat te values of m and d are available a priori or selected by conventional model selection tecniques. Good references on te order selection are given in [7, 16]. A model of te mapping (2) can be used for two objectives: one-step prediction and iterated prediction. In te first case, te m previous values of te series are assumed to be available and te problem is equivalent to a problem of function estimation. In te case of iterated prediction, te predicted output is fed back as an input to te following prediction. Hence, te inputs consist of predicted values as opposed to actual observations of te original time series. A prediction iterated for H times returns a H-step-aead forecasting. Examples of iterated approaces are recurrent neural networks [17] or local learning iterated tecniques [9, 12]. Anoter way to perform H-step-aead forecasting is to ave a model wic returns a direct forecast at time t +, = 1,...,H: t+ = f ( t d, t d 1,..., t d m+1 ) Direct metods often require ig functional complexity in order to emulate te system. In some cases te direct prediction metod yields better results tan te iterated one [16]. An example of combination of local tecniques of integrated and direct type is provided by Sauer [15]. Iterated and direct tecniques for multi-step-aead prediction sare a common feature: tey model from istorical data a multi-input single-output mapping were te output is te variable t+1 in te iterated case and te variable t+ in te direct case, respectively. Tis paper advocates tat wen a very long term prediction is at stake and a stocastic setting is assumed, te modeling of a single-output mapping neglects te existence of stocastic dependencies between future values, (e.g. t+ and t++1 ) and consequently biases te prediction accuracy. A possible way to remedy to tis sortcoming is to move from te modeling of single-output mapping to te modeling of multi-output dependencies. Tis requires te adoption of a multi-output tecnique were te predicted value is no more a scalar quantity but a vector of future values of te time series. If tere are multiple outputs it is common, apart from some exceptions [11], to treat te prediction problem as a set of independent problems, one per output. Unfortunately tis is not effective if te output noises are correlated as it is te case in a time series. Te contribution of te paper is to present a simple extension of te Lazy Learning paradigm to te multi-output setting[5, 2]. Lazy Learning (LL) is a local modeling tecnique wic is query-based in te sense tat te wole learning procedure (i.e. structural and parametric identification) is deferred until a prediction is required. In previous works we presented an original Lazy Learning algoritm [5, 2] tat selects automatically on a query-byquery basis te optimal number of neigbors. Iterated versions of Lazy Learning were successfully applied to multi-step-aead time series prediction [4, 6]. Tis

3 paper presents instead a multi-output version of LL for te prediction of multiple and dependent outputs in te context of long term prediction. 2 Multi-step-aead and multi-output models Let us consider a stocastic time-series of dimension m described by te stocastic dependency t+1 = f ( t d, t d 1,..., t d m+1) + w(t) = f(x) + w(t) (3) were w is a zero-mean noise term and X denotes te lag vector X = { t d, t d 1,..., t d m+1 } Suppose we ave measured te series up to time t and tat we intend to forecast te next H, H 1, values. Te problem of predicting te next H values boils down to te estimation of te distribution of te H dimensional random vector Y = { t+1,..., t+h } conditional on te value of X. In oter terms, te stocastic dependency (2) between a future value t of te time series and te past observed values X induces te existence of a multivariate conditional probability p(y X) were Y R H and X R m. Tis distribution can be igly complex in te case of a large dimensionality m of te series and a long term prediction orizon H. An easy way to visualize and reason about tis complex conditional distribution is to use a probabilistic grapical model approac. Probabilistic grapical models [10] are graps in wic nodes represent random variables, and te lack of arcs represent conditional independence assumptions. For instance te probabilistic dependencies wic caracterize a multi-step-aead prediction problem for a time series of dimension m = 2, lag time d = 0 and orizon H = 3 can be represented by te grapical model in Figure 1. Note tat in tis figure, X = { t, t 1 } and Y = { t+1, t+2, t+3 }. Tis grap sows tat t 1 as a direct influence on t+1 but only an indirect influence on t+2. At te same time t+1 and t+3 are not conditionally independent given te vector X = { t, t 1 }. Any forecasting metod wic aims to perform multi-step aead prediction implements (often in an implicit manner) an estimator of te igly multivariate conditional distribution p(y X). Te grapical model representation can elp us in visualizing te differences between te two most common multi-step-aead approaces, te iterated and te direct one. Te iterated prediction approac replaces te unknown random variables { t+1,..., t+h 1 } wit teir estimations {ˆ t+1,..., ˆ t+h 1 }. In grapical terms tis metod models an approximation (Figure 2) of te real conditional distribution were te topology of conditional dependencies is preserved toug non observable variables are replaced by teir noisy estimators. Te direct prediction approac transforms te problem of modeling te multivariate distribution p(y X) into H distinct and parallel problems were te target conditional distribution is p( t+ X), = 1,...,H. Te topology of te dependencies of te original condition distribution is ten altered as sown in Figure 3. Note

4 t 1 t t+1 t+2 t+3 Fig. 1: Grapical modeling representation of te conditional distribution p(y X) for H = 3, m = 2, d = 0 t 1 t t+1 t+2 t+3 Fig. 2: Grapical modeling representation of te distribution modeled by te iterated approac in te H = 3, m = 2, d = 0 prediction problem.

5 t 1 t t+1 t+2 t+3 Fig. 3: Grapical modeling representation of te distribution modeled by te direct approac in te H = 3, m = 2, d = 0 prediction problem. tat in grapical model terminology tis is equivalent to make a conditional independence assumption p(y X) = p({ t+1,..., t+h } X) = H p( t+ X) Suc assumption is well known in te macine learning literature since it is exploited by te Naive Bayes classifier to simplify multivariate classification problems. Figures 2 and 3 visualize te disadvantages associated to te adoption of te iterated and te direct metod, respectively. Iterated metods may suffer of low performance in long orizon tasks. Tis is due to te fact tat tey are essentially models tuned wit a one-step-aead criterion and terefore tey are not able to take temporal beavior into account. In terms of bias/variance decomposition we can say tat te iterated approac returns a non biased estimator of te conditional distribution p(y X) since it preserves te dependencies between te components of te vector Y toug it suffers of ig variance because of te propagation and amplification of te prediction error. On te oter side, direct metods, by making an assumption of conditional independence, neglect complex dependency patterns existing between te variables in Y and consequently return a biased estimator of te multivariate distribution p(y X). In order to overcome tese sortcomings, tis paper proposes a multi-input multi-output approac were te modeling procedure does not target any more single-output mappings (like t+1 = f(x) + w or t+k = f k (X) + w) but te multi-output mapping Y = F(X) + W were F : R m R H and te covariance of te noise vector W is not necessarily diagonal or symmetrical [11]. Te multi-output model is expected to return a multivariate estimation of te joint distribution p(y X) and, by taking into account te dependencies between te components of Y, to reduce te bias of te direct estimator. However, it is wort noting tat, in case of a large forecasting =1

6 orizons H, te dimensionality of Y is large too, and te multivariate estimation could be vulnerable to large variance. A possible countermeasure to suc a side effect is te adoption of combination strategies, wic are well reputed to reduce variance in case of low bias estimators. Te idea of combining predictors is well known in te time series literature [15]. Wat is original ere is tat a multi-output approac allows te availability of a large number of estimators once te prediction orizon H is long. Tink for example to te case were H = 20 and we want to estimate te value t+10. A simple way to make suc estimate more robust and accurate is to compute and combine several long term estimators wic ave an orizon larger tan 10 (e.g. all te predictors wit orizon between 10 and 20). For multi-output prediction problems te availability of learning algoritms is muc more reduced tan in te single output case [11]. Most of existing approaces propose wat is actually done by te direct approac, tat is to decompose te problem into several multi-output single-output problems by making te assumption of conditional independence. Wat we propose ere is to remove tis assumption by using a multivariate estimation of te conditional distribution. For tis purpose we adopt a nearest neigbor estimation approac were te problem of adjusting te size of te neigborood (bandwidt) is solved by a strategy successfully adopted in our previous work on te Lazy Learning algoritm [5, 2]. 3 A locally constant metod for multi-output regression We discuss ere a locally constant multi-output regression metod to implement a multi-step-aead predictor. Te idea is to return, instead of a scalar, a vector wic smootes te continuation of te trajectories wic at time t resemble te most to te trajectory X. Tis metod is a multi-output extension of te Lazy Learning algoritm [5, 2] and is referred to as LL-MIMO. Te adoption of a local approac to solve a prediction task requires te definition of a set of model parameters (e.g. te number of neigbors, te kernel function, te parametric family, te distance metric). In local learning literature different metods exist to automatically select te adequate configuration [1, 2] by adopting tools and tecniques from te field of linear statistical analysis. One of tese tools is te PRESS statistic wic is a simple, well-founded and economical way to perform leave-one-out (l-o-o) cross-validation and to assess te performance in generalization of local linear models. By assessing te performance of eac local model, alternative configurations can be tested and compared in order to select te best one in terms of expected prediction. Tis is known as te winner-takes-all approac in model selection. An alternative to te winnertakes-all approac was proposed in [5, 2] and consists in combining several local models by using te PRESS leave-one-out error to weig te contribution of eac term. Tis appears to particularly effective in large variance settings [3] as it is presumably te case of a stocastic multi-step-aead task. LL-MIMO extends te bandwidt combination strategy to te multi-output case were H denotes bot te orizon of te long term prediction and te number of outputs. Wat we propose is a combination of local approximators wit different bandwidts were te weigting criterion depends on te multiple

7 step leave-one-out errors e, = 1,...,H, computed over te orizon H. In order to apply local learning to time series forecasting, te time series is embedded into a dataset D N made of N pairs (X i, Y i ), were X i is a temporal pattern of lengt m, and te vector Y i is te consecutive temporal pattern of lengt H. Suppose te series is measured up to time t and assume for simplicity tat te lag d = 0. Let us denote X = { t,..., t m+1 } te lag embedding vector at time t. Given a metric on te space R m let us order increasingly te set of vectors X i wit respect to te distance to X and denote by [j] te index of te jt closest neigbor of X. For a given number k of neigbors te H step prediction is a vector wose t component is te average Ŷ k = 1 k were Y [j] is te output vector of te jt closest neigbor of X in te training set D N. We can associate to te estimation Ŷ k a multi-step leave-one-error E k = 1 H were e is te leave-one-out error of a constant model used to approximate te output at te step. In case of constant model te l-o-o term is easy to derive [3] e = k Y [j] Ŷ k k 1 Toug te optimal number of neigbors k is not known a priori, in [5, 2] we sowed tat an effective strategy consists in (i) allowing k to vary in a set k 1,...,k b and (ii) returning a prediction wic is te combination of te predictions Ŷ ki for eac bandwidt k i, i = 1,...,b. If we adopt as combination strategy te generalized ensemble metod proposed in [14], we obtain tat te outcome of te LL-MIMO algoritm is a vector of size H wose t term is k j=1 H =1 Y [j] e 2 ˆ t+ = Ŷ = b i=1 ζ i Ŷ ki b i=1 ζ, = 1,...,H (4) i and te weigts are te inverse of te multiple-step l-o-o mean square errors: ζ i =1/E ki. 4 Experiments and final considerations Te LL-MIMO approac as been tested by applying it to te prediction of te tree time series from te ESTSP08 Competition. Te first time series (ESTSP1) as a training set of 354 tree-dimensional vectors and te task is to predict te continuation of te tird variable for H = 18 steps. Te second time

8 series (ESTSP2) as a training set of 1300 values and te task is to predict te continuation for H = 100 steps. Te tird time series (ESTSP3) as a training set of values and te task is to predict te continuation for H = 200 steps. Te experimental session aims to compare te following set of metods on a long term prediction task (i) a conventional iterated approac (ii) a direct approac (iii) a multi-output LL-MIMO approac (iv) a combination of several LL-MIMO predictors (denoted by LL-MIMO-COMB) (v) a combination of te LL-MIMO and te iterated approac (denoted by LL-MIMO-IT). In te strategy LL-MIMO-COMB te prediction at time t + is ˆ t+ = H j= Ŷ (Hj) H + 1, were Ŷ (Hj) is te prediction of a multi-output LL-MIMO for an orizon H j. In te strategy LL-MIMO-IT te prediction ˆ t+ = Ŷ (H) + Ŷ it were Ŷ it is te prediction returned by an iterated sceme. Te rationale beind tis two averaging metods is te reduction of te variance as discussed at te end of Section 2. Note tat in all te considered tecniques te learner is implemented by te same local learning tecnique wic combines a set of constant models wose number of neigbors range in te same interval [5, k b ] wit k b parameter of te algoritm. In order to perform a correct comparison all te tecniques are tested under te same conditions in terms of test intervals, embedding order m, values of k b and lag time d. In detail te series ESTSP1 is used to assess te five tecniques on te last portion of te training set of size H = 18, for values of m ranging from 5 to 20, for values of d ranging from 0 to 1 and and for k b ranging from 10 to 25, te series ESTSP2 is used to assess te five tecniques on te last portion of te training set of size H = 100, for values of m ranging from 5 to 35, for values of d ranging from 0 to 1 and for k b ranging from 10 to 25, te series ESTSP3 is used to assess te five tecniques on te last portion of te training set of size H = 200 for m {20, 50, 80,..., 200}, for values of d ranging from 0 to 2 and for k b ranging from 10 to 15. Table 1 compares te average NMSE (Normalized 1 Mean Squared Error) prediction errors of te five tecniques for te tree datasets. Te bold notation designs te tecnique wic is significantly better tan all te oters (wit 0.05 significativity level of te permutation test). Table 2 compares te minimum of te NMSE prediction errors attained by te five tecniques over all te different configurations in terms of dimension m, lag time d and number k b. Te experimental results sow tat for long term prediction tasks te LL- MIMO-COMB and LL-MIMO-IT strategies, i.e. te averaging formulations 1 Te normalization is done wit respect to te variance of te entire series 2

9 Table 1: Average NMSE of te predictions for te tree time series. Te bold notation stands for significantly better tan all te oters at 0.05 significativity level of te paired permutation test. Test data LL-IT LL-DIR LL-MIMO LL-MIMO-COMB LL-MIMO-IT ESTSP ESTSP ESTSP3 1.63e e e e e-2 Table 2: Minimum NMSE of te predictions for time series. Test data LL-IT LL-DIR LL-MIMO LL-MIMO-COMB LL-MIMO-IT ESTSP ESTSP ESTSP3 1.00e e e e e-2 of te LL-MIMO algoritm, can outperform conventional direct and iterated metods. LL-MIMO alone does not emerge as a competitive algoritm probably because of te excessive variance induced by te large dimensionality. Te low biased nature of LL-MIMO owever makes of tis approac a good candidate for averaging approaces, as demonstrated by te good performance of LL- MIMO-COMB and LL-MIMO-IT. On te basis of tese experiences we decided to submit to te Competition te LL-MIMO-IT prediction of te continuation of ESTP2, and te LL-MIMO-COMB prediction of te continuation of ESTP1 and ESTP3. A plot of te LL-MIMO-COMB prediction on te last portion of ESTP3 is illustrated in Figure 4. We ope tat te final validation provided by te Competition continuation series will confirm te importance of multi-output strategies in long term time series forecasting. References [1] C. G. Atkeson, A. W. Moore, and S. Scaal. Locally weigted learning. Artificial Intelligence Review, 11(1 5):11 73, [2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets te recursive least-squares algoritm. In M. S. Kearns, S. A. Solla, and D. A. Con, editors, NIPS 11, pages , Cambridge, MIT Press. [3] G. Bontempi. Local Learning Tecniques for Modeling, Prediction and Control. PD tesis, IRIDIA- Université Libre de Bruxelles, [4] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for iterated time series prediction. In J. A. K. Suykens and J. Vandewalle, editors, Proceedings of te International Worksop on Advanced Black-Box Tecniques for Nonlinear Modeling, pages Katolieke Universiteit Leuven, Belgium, 1998.

10 ESTSP Fig. 4: ESTSP3: time series (line) vs. LL-MIMO-COMB prediction (dots). [5] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for modeling and control design. International Journal of Control, 72(7/8): , [6] G. Bontempi, M. Birattari, and H. Bersini. Local learning for iterated time-series prediction. In I. Bratko and S. Dzeroski, editors, Macine Learning: Proceedings of te Sixteent International Conference, pages 32 38, San Francisco, CA, Morgan Kaufmann Publisers. [7] M. Casdagli, S. Eubank, J. D. Farmer, and J. Gibson. State space reconstruction in te presence of noise. Pysica D, 51:52 98, [8] J. Fan and Q. Yao. Nonlinear Time Series. Springer, [9] J. D. Farmer and J. J. Sidorowic. Predicting caotic time series. Pysical Review Letters, 8(59): , [10] F. V. Jensen. Bayesian Networks and Decision Graps. Springer, [11] J. M. Matias. Multi-output nonparametric regression. In Progress in Artificial Intelligence, pages , [12] J. McNames, J. Suykens, and J. Vandewalle. Winning contribution of te k.u. leuven time-series prediction competition. International Journal of Bifurcation and Caos, to appear. [13] N. H. Packard, J. P. Crutcfeld, J. D. Farmer, and R. S. Saw. Geometry from a time series. Pysical Review Letters, 45(9): , [14] M. P. Perrone and L. N. Cooper. Wen networks disagree: Ensemble metods for ybrid neural networks. In R. J. Mammone, editor, Artificial Neural Networks for Speec and Vision, pages Capman and Hall, [15] T. Sauer. Time series prediction by using delay coordinate embedding. In A. S. Weigend and N. A. Gersenfeld, editors, Time Series Prediction: forecasting te future and understanding te past, pages Addison Wesley, Harlow, UK, [16] A. Sorjamaa, J. Hao, N. Reyani, Y. Ji, and A. Lendasse. Metodology for longterm prediction of time series. Neurocomputing, [17] R. Williams and D. Zipser. A learning algoritm for continually running fully recurrent neural networks. Neural Computation, 1: , 1989.

Long-Term Prediction of Time Series by combining Direct and MIMO Strategies

Long-Term Prediction of Time Series by combining Direct and MIMO Strategies Long-Term Prediction of Series by combining Direct and MIMO Strategies Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa and Amaury Lendasse Abstract Reliable and accurate prediction of time series