arxiv: v1 [physics.comp-ph] 21 Feb 2018

Size: px

Start display at page:

Download "arxiv: v1 [physics.comp-ph] 21 Feb 2018"

Annabelle Paul
5 years ago
Views:

1 arxiv: v1 [physics.cmp-ph] 21 Feb 218 rspa.ryalscietypublishing.rg Research Article submitted t jurnal Subject Areas: Mechanical Engineering Keywrds: Data-driven frecasting, Lng-Shrt Term Memry, Gaussian Prcesses, T21 bartrpic climate mdel, Lrenz 96 Authr fr crrespndence: Petrs Kumutsaks petrs@ethz.ch Data-Driven Frecasting f High-Dimensinal Chatic Systems with Lng-Shrt Term Memry Netwrks Pantelis R. Vlachas 1, Wnmin Byen 1, Zhng Y. Wan 2, Themistklis P. Sapsis 2, Petrs Kumutsaks 1 1 Chair f Cmputatinal Science, ETH Zurich, Clausiusstrasse 33, Zurich, CH-892, Switzerland 2 Department f Mechanical Engineering, Massachussetts Institute f Technlgy, 77 Massachusetts Ave., Cambridge, MA 2139, United States We intrduce a data-driven frecasting methd fr high dimensinal, chatic systems using Lng-Shrt Term Memry (LSTM) recurrent neural netwrks. The prpsed LSTM neural netwrks perfrm inference f high dimensinal dynamical systems in their reduced rder space and are shwn t be an effective set f nn-linear apprximatrs f their attractr. We demnstrate the frecasting perfrmance f the LSTM and cmpare it with Gaussian prcesses (GPs) in time series btained frm the Lrenz 96 system, the Kuramt-Sivashinsky equatin and a prttype climate mdel. The LSTM netwrks utperfrm the GPs in shrt-term frecasting accuracy in all applicatins cnsidered. A hybrid architecture, extending the LSTM with a mean stchastic mdel (MSM-LSTM), is prpsed t ensure cnvergence t the invariant measure. This nvel hybrid methd is fully data-driven and extends the frecasting capabilities f LSTM netwrks. c The Authrs. Published by the Ryal Sciety under the terms f the Creative Cmmns Attributin License by/4./, which permits unrestricted use, prvided the riginal authr and surce are credited.

2 1. Intrductin Natural systems, ranging frm atmspheric climate and cean circulatin t rganisms and cells, invlve cmplex dynamics extending ver multiple spati-tempral scales. Centuries ld effrts t cmprehend and frecast the dynamics f such systems have spurred develpments in large scale simulatins, dimensinality reductin techniques and a multitude f frecasting methds. The gals f understanding and predictin have been cmplementing each ther but have been hindered by the high dimensinality and chatic behavir f these systems. In recent years we bserve a cnvergence f these appraches due t advances in cmputing pwer, algrithmic innvatins and the ample availability f data. A majr beneficiary f this cnvergence are data-driven dimensinality reductin methds [2 7], mdel identificatin prcedures [9 13] and frecasting techniques [14 19] that aim t prvide precise shrt term predictins while capturing the lng term statistics f these systems. Successful frecasting methds address the highly nnlinear energy transfer mechanisms between mdes nt captured effectively by the dimensinality reductin methds. The pineering technique f analg frecasting prpsed in [2] inspired a widespread research in nn-parametric predictin appraches. Tw dynamical system states are called analgues if they resemble ne anther n the basis f a specific criterin. This class f methds uses a training set f histrical bservatins f the system. The system evlutin is predicted using the evlutin f the clsest analgue frm the training set crrected by an errr term. This apprach has led t prmising results in practice [21] but the selectin f the resemblance criterin t pick the ptimal analgue is far frm straightfrward. Mrever, the gemetrical assciatin between the current state and the training set is nt explited. Mre recently [22], analg frecasting is perfrmed using a weighted cmbinatin f data-pints based n a lcalized kernel that quantifies the similarity f the new pint and the weighted cmbinatin. This technique explits the lcal gemetry instead f selecting a single ptimal analgue. Similar kernel-based methds, [23,24] use diffusin maps t glbally parametrize a lw dimensinal manifld capturing the slwer time scales. Mrever, nn-trivial interplatin schemes are investigated in rder t encde the system dynamics in this reduced rder space as well as map them t the full space (lifting). Althugh the gemetrical structure f the data is taken int accunt, the slutin f an eigen-system with a size prprtinal t the training data is required, rendering the apprach cmputatinally expensive. In additin, the inherent uncertainty due t sparse bservatins in certain regins f the attractr intrduces predictin errrs which cannt be mdeled in a deterministic cntext. In [25] a methd based n Gaussian prcess regressin (GPR) [26] was prpsed fr predictin and uncertainty quantificatin in the reduced rder space. The technique is based n a training set that sparsely samples the attractr. Stchastic predictins explit the gemetrical relatinship between the current state and the training set, assuming a Gaussian prir ver the mdeled latent variables. A key advantage f GPR is that uncertainty bunds can be analytically derived frm the hyper-parameters f the framewrk. Mrever, in [25] a Mean Stchastic Mdel (MSM) is used fr under-sampled regins f the attractr t ensure accurate mdeling f the steady state in the lng term regime. Hwever the resulting inference and training have a quadratic cst in terms f the number f data samples O(N 2 ). Sme f the earlier appraches t capture the evlutin f time series in chatic systems using recurrent neural netwrks were develped during the inceptin f the Lng-Shrt Term Memry netwrks (LSTM) [27]. Hwever, t the best f ur knwledge, these methds have been used nly n lwdimensinal chatic systems [34]. Similarly, ther machine learning algrithms such as Ech State Netwrks [36,37] and radial basis functins [38,39] have been successful, albeit nly fr lw rder dynamical systems. In this wrk, we prpse LSTM based methds that explit infrmatin f the recent histry f the reduced rder state t predict the high-dimensinal dynamics. Time-series data are used t train the mdel while n knwledge f the underlying system equatins is required. Inspired by Taken s therem [4] an embedding space is cnstructed using time delayed versins f the 2 rspa.ryalscietypublishing.rg Prc R Sc A

3 reduced rder variable. The prpsed methd tries t identify an apprximate frecasting rule glbally fr the reduced rder space. In cntrast t GPR [25], the methd has a deterministic utput while its training cst scales linearly with the number f training samples and it exhibits an O( ) inference cmputatinal cst. Mrever, fllwing [25], LSTM is cmbined with a MSM, t cpe with attractr regins that are nt captured in the training set. In attractr regins, under-represented in the training set, the MSM is used t guarantee cnvergence t the invariant measure and avid an expnential grwth f the predictin errr. The effectiveness f the prpsed hybrid methd in accurate shrt term predictin and capturing the lng-term behavir is shwn in the Lrenz 96 system and the Kuramt-Sivashisky system. Finally the methd is als tested n predictins f a prttypical climate mdel. The structure f the paper is as fllws: In Sectin 2 we explain hw the LSTM can be emplyed fr mdeling and predictin f a reference dynamical system and a blended LSTM- MSM technique is intrduced. In Sectin 3 three ther state f the art methds, GPR, MSM and the hybrid GPR-MSM scheme are presented and tw cmparisn metrics are defined. The prpsed LSTM technique and its LST-MSM extensin are benchmarked in three cmplex chatic systems in Sectin 4. In Sectin 5 we discuss the cmputatinal cmplexity f training and inference in LSTM. Finally, Sectin 6 ffers a summary and discusses future research directins. 2. Lng-Shrt Term Memry (LSTM) Recurrent Neural Netwrks The LSTM was intrduced in rder t regularize the training f recurrent neural netwrks (RNNs) [27]. RNNs cntain lps that allw infrmatin t be passed between cnsecutive tempral steps (see Figure 1) and can be expressed as: 3 rspa.ryalscietypublishing.rg Prc R Sc A ( ) h t = σ h Whi i t + W hh h t 1 + b h, (2.1) ( ) t = σ Wh h t + b (2.2) where i t, t and h t are the input, the utput and the hidden state f the RNN at time step t, while D represents a delay blck and W hi, W hh, W h are the input-t-hidden, hidden-t-hidden and hidden-t-utput weight matrices. Mrever, σ h and σ are the hidden and utput activatin functins, while b h and b are the respective biases. Tempral dependencies are captured by the hidden-t-hidden weight matrix W hh, which cuples tw cnsecutive hidden states tgether. The RNN can be viewed in its unflded frm in Figure 2. In many practical applicatins, RNNs Figure 1: RNN Figure 2: RNN unflded in time suffer frm the vanishing (r explding) gradient prblem and have failed t capture lng term dependencies [41,42]. Tday the RNNs we their renaissance largely t the LSTM, that cpes effectively with the afrementined prblem using gates. The LSTM has been successfully applied in sequence mdeling [32], speech recgnitin [28 3], hand-writing recgnitin [31] and language translatin [33].

4 The equatins f the LSTM are g f t = σ ( ) f Wf [h t 1, i t ] + b f gt i ( ) = σ i Wi [h t 1, i t ] + b i C t = tanh ( W C [h t 1, i t ] + b C ) (2.3) (2.4) (2.5) C t = g f t C t 1 + gt i C t (2.6) gt ( ) = σ h Wh [h t 1, i t ] + b h (2.7) h t = g t tanh(c t ), (2.8) where g f t, gi t and g t are the gate signals (frget, input and utput gates), i t is the input, h t is the hidden state, C t is the cell state, while W f, b f, W i, b i, W C, b C, W h and b h are weight matrices and biases f apprpriate dimensins. The activatin functins σ f, σ i and σ h are sigmids. Fr a mre detailed explanatin n the LSTM architecture refer t [27]. The hidden state h t R h, with h the number f hidden units. In practice we want the utput t have a specific dimensin d. Fr this reasn, a trivial fully cnnected final layer withut activatin functin is added t = W h h t, (2.9) with W h R d h. In the fllwing we refer t the LSTM hidden and cell states (h t and C t ) jintly as LSTM states. In this wrk, we cnsider the reduced rder prblem where the system state is prjected in the reduced rder space. Mrever, the system is cnsidered t be autnmus, while z t = dzt dt is the system state derivative at time step t. The LSTM mdel is trained using time series data frm the system t predict the state derivative t ˆ= z t = dzt dt at time t, using delayed versins f the reference reduced mdel state z t. It is a slely data-driven apprach and n explicit infrmatin regarding the frm f the underlying equatins is required. 4 rspa.ryalscietypublishing.rg Prc R Sc A (a) Training and inference The available time series data are divided int tw separate sets, the training dataset and the validatin dataset, i.e. zt train, zt train, t {1,, N train }, and zt val, zt val, t {1,, N val }. N train and N val are the number f training and validatin samples respectively. This data is stacked in batches as z train t+d 1 I train z train t+d 2 t =, train t = zt+d 1 train, (2.1). }{{} zt train Output batch }{{} Input batch fr t {1, 2,..., N train d + 1}, in rder t frm the training (and validatin) input and utput f the LSTM. These training batches are used t ptimize the parameters f the LSTM (weights and biases) in rder t learn the mapping I t t. The training prceeds by ptimising the netwrk weights iteratively fr each batch (training f ne epch). The training lss functin is a weighted versin f the rt mean square errr, i.e. 1 ( lss = d d i=1 w i train, i 2 t t) i where d is the dimensin f the utput f the LSTM, and the weights w i are selected accrding t the significance f each utput cmpnent, e.g. energy f each cmpnent. Mrever, the LSTM is trained using truncated Back-prpagatin Thrugh Time (BPTT) [35]. The BPTT is truncated after layer d. As a cnsequence, the LSTM is trained t predict the derivative at time t using infrmatin frm the previus d time steps. An imprtant issue is hw t select the hidden state dimensin h and hw t initialize the LSTM states at the truncatin layer d. A small h reduces the expressive capabilities f the LSTM

5 and deterirates inference perfrmance. On the ther hand, a big h leads t fast verfitting, an upturn in the generalizatin errr and increased cmputatinal cst f training. Fr this reasn, h has t be tuned depending n the bserved data (training and validatin). Fr the truncatin layer d, there are tw alternatives, namely stateless and statefull LSTM. In stateless LSTM the LSTM states at layer d are initialized t zer. As a cnsequence, the LSTM can nly capture dependencies up t d previus time steps. In the secnd variant, the statefull LSTM, the state is always prpagated fr p time steps in the future and then reinitialized t zer, t help the LSTM capture lnger dependencies. In this wrk, the systems cnsidered exhibit chatic behavir and the dependencies are inherently shrt term, as the states in tw time steps that differ significantly can be cnsidered statistically independent. Fr this reasn, the shrt tempral dependencies can be captured withut prpagating the hidden state fr a lng hrizn. As a cnsequence, we cnsider nly the stateless variant p =. We als applied statefull LSTM withut any significant imprvement s we mit the results fr brevity. Optimizatin during training is perfrmed using the Adam stchastic ptimizatin methd [1] with an adaptive learning rate (initial learning rate η =.1). Training is stpped when cnvergence f the training errr is detected r the maximum r 1 epchs is reached. The LSTM mdel with the smallest validatin errr is cnsidered t avid ver-fitting. The trained LSTM mdel can be used t frecast the system state in the next time steps in an iterative fashin. The histry f the system up t time step d, i.e. z1 true,..., zd true, is assumed t be knwn. We initialize the LSTM states with h and C and we use the trained LSTM t predict the derivative z pred d. By integrating the derivative with a reference time difference dt and initial cnditin zd true the value z pred d+1 is btained. This value is used fr the next predictin in an iterative fashin as illustrated in Figure 3. In statefull LSTM, initial values fr h and C can be btained by teacher frcing the LSTM fr a few time steps prpagating values frm the knwn histry and ignring the utputs. In stateless LSTM, h and C are initialized with zer vectrs. 5 rspa.ryalscietypublishing.rg Prc R Sc A Figure 3: Iterative predictin using LSTM (b) Mean Stchastic Mdel (MSM) and Hybrid LSTM-MSM The MSM is a pwerful data-driven methd used t quantify uncertainty and perfrm frecasts in turbulent systems with high intrinsic attractr dimensinality [25,43]. It is parametrized a priri t capture glbal statistical infrmatin f the attractr by design, while its cmputatinally cmplexity is very lw cmpared t LSTM r GPR. The cncept behind MSM is t mdel each cmpnent f the state z i independently with an Ornstein-Uhnelbeck (OU) prcess that captures the energy spectrum and the damping time scales f the statistical equilibrium. The prcess takes the fllwing frm dz i = c i z i dz + ξ i dw i, (2.11)

6 where c i, ξ i are parameters fitted t the centered training data and W i is a wiener prcess. In the statistical steady state the mean, energy and damping time scale f the prcess are given by 6 µ i = E[z i ] =, E i = E[z i (z i ) ] = ξ2 2c i, T i = 1 c i. (2.12) In rder t fit the mdel parameters c i, ξ i we directly estimate the variance E[z i (z i ) ] frm the time series training data and the decrrelatin time using 1 T i = E[z i (z i ) E[z i (t)(z i ) (t + τ)dτ. (2.13) ] After cmputing these tw quantities we replace in (2.12) and slve with respect t c i and ξ i. Since the MSM is mdelled a priri t mimic the glbal statistical behavir f the attractr, frecasts made with MSM can never escape. This is nt the case with LSTM and GPR, as predictin errrs accumulate and iterative frecasts escape the attractr fast due t the chatic dynamics, althugh shrt term predictins are accurate. This prblem has been addressed with respect t GPR in [25]. In rder t cpe effectively with this prblem we intrduce a hybrid LSTM-MSM technique that prevents frecasts frm diverging frm the attractr. The state dependent decisin rule fr frecasting in LSTM-MSM is given by { ( zt ) LST M, if p train (z t ) = p train i (zt) i > δ z t = (2.14) ( z t ) MSM, therwise where p train (z t ) is an apprximatin f the prbability density functin f the training dataset and δ.1 a cnstant threshld tuned based n p train (z t ). We apprximate p train (z t ) using a mixture f Gaussian kernels. This hybrid architecture explits the advantages f LSTM and MSM. In case there is a high prbability that the state z i lies clse t the training dataset (interplatin) the LSTM having memrized the lcal dynamics is used t perfrm inference. This ensures accurate LSTM shrt-term predictins. On the ther hand, clse t the bundaries the attractr is nly sparsely sampled p train (z i ) < δ and errrs frm LSTM predictins wuld lead t divergence. In this case, MSM guarantees that frecasting trajectries remain clse t the attractr, and that we cnverge t the statistical invariant measure in the lng-term. rspa.ryalscietypublishing.rg Prc R Sc A 3. Benchmark and Perfrmance Measures The perfrmance f the prpsed LSTM based predictin mechanism is benchmarked against the fllwing state-f-the-art methds: Mean Stchastic Mdel (MSM) Gaussian Prcess Regressin (GPR) Mixed Mdel (GPR-MSM) In rder t guarantee that the predictin perfrmance is independent f the initial cnditin selected, fr all applicatins and all perfrmance measures cnsidered the average value f each measure fr a number f different initial cnditins sampled independently and unifrmly frm the attractr is reprted. The grund truth trajectry is btained by integrating the discretized reference equatin starting frm each initial cnditin, and prjecting the states t the reduced rder space. The reference equatin and the prjectin methd are f curse applicatin dependent. Frm each initial cnditin, we generate an empirical Gaussian ensemble f dimensin N en arund the initial cnditin with a small variance σ en. This nise represents the uncertainty in the knwledge f the initial system state. We frecast the evlutin f the ensemble by iteratively predicting the derivatives and integrating (deterministically fr each ensemble member fr the LSTM, stchastically fr GPR) and we keep track f the mean. The ensemble size N ensemble is

7 selected in the rder f 5, which is the usual chice in envirnmental science, e.g. weather predictin and shrt term climate predictin [44]. The grund truth trajectry at each time instant z is then cmpared with the predicted ensemble mean z. As a cmparisn measure we use the rt mean square errr (RMSE) defined as RMSE(z k ) = 1/V V i=1 ( z i k z i k) 2,where index k dentes the k th cmpnent f the reduced rder state z, i is the initial cnditin, and V is the ttal number f initial cnditins. The RMSE is cmputed at each time instant fr each cmpnent k f the reduced rder state, resulting in errr curves that describe the evlutin f errr with time. Mrever, we use the mean Anmaly Crrelatin (AC) [47] ver V initial cnditins t quantify the pattern crrelatin f the predicted trajectries with the grund-truth. The AC is defined as AC = 1 V V i=1 rdim k=1 w k ( )( ) rdim k=1 w k zk i z k z k i z k ( z i k z k) 2 rdim k=1 w k ( ), 2 (3.1) z k i z k where k refers t the mde number, i refers t the initial cnditin, w k are mde weights selected accrding t the energies f the mdes after dimensinality reductin and z k is the time average f the respective mde, cnsidered as reference. This scre ranges frm 1. t 1.. If the frecast is perfect, the scre equals t 1.. The AC cefficient is a widely used frecasting accuracy scre in the meterlgical cmmunity [46]. 4. Applicatins In this sectin, the effectiveness f the prpsed methd is demnstrated with respect t three chatic dynamical systems, exhibiting different levels f chas, frm weakly chatic t fully turbulent, i.e. the Lrenz 96 system, the Kuramt-Sivashinsky equatin and a prttypical bartrpic climate mdel. 7 rspa.ryalscietypublishing.rg Prc R Sc A (a) The Lrenz 96 System In [45] a mdel f the large-scale behaviur f the mid-latitude atmsphere is intrduced. This mdel describes the time evlutin f the cmpnents X j fr j {, 1,..., J 1} f a spatially discretized (ver a single latitude circle) atmspheric variable. In the fllwing we refer t this mdel as the Lrenz 96. The Lrenz 96 is usually used ( [25,46] and references therein) as a ty prblem t benchmark methds fr weather predictin. The system f differential equatins that gverns the Lrenz 96 is defined as dx j dt = (X j+1 X j 2 )X j 1 X j + F, (4.1) fr j {, 1,..., J 1}, where by definitin X 1 = X J, X 2 = X J 1. In ur analysis J = 4. The right-hand side f (4.1) cnsists f a nn-liner adjective term (X j+1 X j 2 )X j 1 X j, a linear advectin (dissipative) term X j and a psitive external frcing term F. The discrete energy f the system remains cnstant thrughut time and the Lrenz 96 states X j remain bunded. By increasing the external frcing parameter F the behavir that the system exhibits changes frm peridic F < 1 t weakly chatic (F = 4) t end up in fully turbulent regimes (F = 16). We refer t X j as the states f the Lrenz 96 mdel. These regimes can be bserved in Figures 4 Fllwing [25,44] we apply a shifting and scaling t standardize the Lrenz 96 states X j. The discrete r Dirichlet energy is given by E = 1 J 2 j=1 X2 j. In rder fr the scaled Lrenz 96 states t have zer mean and unit energy we transfrm them using X j = X j X Ep, d t = E pdt, (4.2)

8 2. F = 4 2. F = 8 2. F = 16 8 t x t x t x Figure 4: Lrenz 96 cntur plts fr different frcing regimes F. Chaticity rises with bigger values f F. where E p is the average energy fluctuatin, i.e. E p = 1 2T J 1 j= T+T T (X j X) 2 dt. (4.3) rspa.ryalscietypublishing.rg Prc R Sc A In this way the scaled energy is Ẽ = 2 1 J 1 X j= j 2 = 1 and the scaled variables have zer mean X = J 1 J 1 X j= j =, with X the mean state. The scaled Lrenz 96 states X j bey the fllwing differential equatin d X j d t = F X E p + ( X j+1 X j 2 )X X j Ep + (i) Dimensinality Reductin: Discrete Furier Transfrm + ( X j+1 X j 2 ) X j 1 (4.4) Firstly, the Discrete Furier Transfrm (DFT) is applied t the energy standardized Lrenz 96 states X j. The Furier cefficients ˆX k C are given by ˆX k = 1 J 1 X j e 2πikj/J (4.5) J j= while the Lrenz 96 states can be recvered frm the Furier cefficients using the inverse DFT J 1 X j = k= ˆX k e 2πikj/J (4.6) After applying the DFT t the Lrenz 96 states we end up with a symmetric energy spectrum that can be uniquely characterized by J/2 + 1 (J is cnsidered t be an even number) cefficients ˆX k fr k K = {, 1,, J/2}. In ur case J = 4, thus we end up with K = 21 cmplex cefficients ˆX k C. These cefficients are referred t as the Furier mdes r simply mdes. The Furier energy f each mde is defined as E k = V ar( ˆX k ) = E [ ( ˆX k ( t) ˆX k )( ˆX k ( t) ˆX k ) ]. (4.7) The energy spectrum f the Lrenz 96 system is pltted in Figure 5 fr different values f the frcing term F. We take int accunt nly the r dim = 6 mdes crrespnding t the highest

9 Energy Ek Wavenumber k Cummulative energy % Number f mst energetic mdes used Figure 5: Energy spectrum E k and cumulative energy with respect t the number f mst energetic mdes used fr different frcing regimes f Lrenz 96 system. As the frcing increases, mre chaticity is intrduced t the system. F = 4 ; F = 8 ; F = 16 Frcing Wavenumbers k Frcing Wavenumbers k F = 4 7,1,14,9,17,16 F = 8 8,9,7,1,11,6 F = 6 8,7,9,1,11,6 F = 16 8,9,1,7,11,6 Table 1: Mst energetic Furier mdes used in the reduced rder phase space rspa.ryalscietypublishing.rg Prc R Sc A energies and the rest f the mdes are truncated. Fr the different frcing regimes F = 1, 2, 3, 4, the six mst energetic mdes crrespnd t apprximately 89%, 57.8%, 52% and 43.8% f the ttal energy respectively. The space where the reduced variables live in is referred t as the reduced rder phase space and the mst energetic mdes are ntated as ˆX k r fr k {, 1,..., r dim 1}. As shwn in [48] the mst energetic mdes are nt necessarily the nes that capture better the dynamics f the mdel. Hwever, in this wrk we are nt interested in an ptimal reduced space representatin, but rather in the effectiveness f a predictin mdel given this space. The respective wavenumbers f the mst energetic mdes as well as their energy are given in Table 1. The truncated mdes are ignred fr nw. Nevertheless, their effect can be mdelled stchastically as in [25]. Since each Furier mde ˆX k r is a cmplex number, it cnsists f a real part and an imaginary part. By stacking these real and imaginary parts f the r dim truncated mdes we end up with the 2 r dim dimensinal reduced mdel state X [Re( ˆX r 1 ),..., Re( ˆX r r dim ), Im( ˆX r 1 ),..., Im( ˆX r r dim )] T (4.8) Assuming that Xj t fr j {, 1,..., J 1} are the Lrenz 96 states at time instant t, the mapping Xj t, j X is unique and the reduced mdel state f the Lrenz 96 has a specific vectr value. Fr high dimensins, Furier Transfrm is equivalent t Principal Cmpnent Analysis. (ii) Training and Predictin in Lrenz 96 The reduced Lrenz 96 system states X t are cnsidered as the true reference states z t. The LSTM is trained t frecast the derivative f the reduced rder state dz t /dt as in [34]. In the fllwing we analyze the influence f the truncatin layer d and the number f hidden units h f the LSTM with respect t the chatic Lrenz 96 system. The influence f d in training and perfrmance f the LSTM mdel is the fllwing. On the ne hand, selecting a large d makes the training mre challenging, fr tw reasns. Firstly, the LSTM has mre layers and secndly mre nise might be included in the input (irrelevant infrmatin)

10 rendering subptimal predictin perfrmance. On the ther hand, selecting a small d might lead t an input sequence with pr infrmatin cntent, leading t lw predictin perfrmance. Increasing the number f hidden ndes h rises the expressiveness f LSTM, but it is easier t verfit the training set. A stateless LSTM is used. The back-prpagatin truncatin hrizn is set t d = 1 and we use h = 2. In rder t btain training data fr the LSTM, we integrate the Lrenz 96 system state Eg. (4.1) starting frm an initial cnditin Xj fr j {, 1,..., J 1} using a Runge-Kutta 4th rder methd with a time step dt =.1 up t T = 51. In this way a time series Xj t, t {, 1, } is cnstructed. Using the scaling and dimensinality reductin methd explained in Sectin i we cnstruct the reduced rder state time series X t, t {, 1, }, using the mapping X j t j X t. Frm this time series we discard the first 1 4 initial time steps t avid transients, ending up with a time series with N train = 5 samples. A similar but independent prcess is repeated fr the validatin set. (iii) Results The trained LSTM mdels are used fr predictin based n the iterative prcedure explained in Sectin 2. In this sectin, we demnstrate the frecasting capabilities f LSTM and cmpare it with the state f the art. 1 different initial cnditins are simulated. Fr each initial cnditin, an ensemble with size N en = 5 is cnsidered by perturbing it with a nrmal nise with variance σ en =.1. In Figures 6a, 6b, and 6c we reprt the mean RMSE predictin errr f the mst energetic mde ˆX r 1 C, scaled with E p fr the frcing regimes F {6, 8, 16} fr the first N = 1 time steps (T =.1). In the RMSE the cmplex nrm v 2 = vv is taken int accunt. The 1% f the standard deviatin f the attractr is als pltted fr reference (1%σ). As F increases, the system becmes mre chatic and difficult t predict. As a cnsequence, the number f predictin steps that remain under the 1%σ threshld are decreased. The LSTM mdels extend this predictability hrizn fr all frcing regimes cmpared t GPR and MSM. Hwever, when LSTM is cmbined with MSM the shrt term predictin perfrmance is cmprmised. Nevertheless, hybrid LSTM- MSM mdels utperfrm GPR methds in shrt term predictin accuracy. In Figures 6d, 6e, and 6f, the RMSE errr fr T = 2 is pltted. The standard deviatin frm the attractr σ is pltted fr reference. We can bserve the fllwing 1 rspa.ryalscietypublishing.rg Prc R Sc A The predictin perfrmance f the LSTM in the quasi-peridic regime F = 4 is clearly superir t all ther appraches. Blending LSTM with MSM guarantees accurate mdeling f the steady state in the lng term, but leads t a perfrmance cmprmise in the shrt-term. LSTM-MSM utperfrms GPR-MSM. In all frcing regimes, bth GPR and LSTM eventually diverge, while MSM, and blended GPR-MSM, LSTM-MSM schemes remain clse t the attractr in the lng term as expected. Fr F = 8 althugh the RMSE errr in the shrt-term is smaller fr LSTM, GPR remains fr a lnger perid clse t the attractr (e.g. T =.75 fr F = 8). Hwever, when blended schemes are taken int accunt, LSTM-MSM shws superir perfrmance in the shrtterm and slightly better perfrmance in the lng term cmpared t GPR-MSM. In Figures 6g, 6h, and 6i, the mean AC ver 1 initial cnditins is given. The predictability threshld f.6 is als pltted. After crssing this critical threshld, the methds d nt predict better than a trivial mean predictr. Fr F = 4 GPR methds shw inferir perfrmance cmpared t LSTM appraches as analyzed previusly in the RMSE cmparisn. Hwever, fr F = 8 LSTM mdels d nt predict better than the mean after T.35, while GPR shws better perfrmance. In turn, when blended with MSM the cmprmise in the perfrmance fr GPR- MSM is much bigger cmpared t LSTM-MSM. The LSTM-MSM scheme shws slightly superir perfrmance than GPR-MSM during the entire relevant time perid (AC >.6). Fr the fully

11 F = 4, wavenumber k = 7.1 F = 8, wavenumber k = 8.15 F = 16, wavenumber k = %σ (a) (d) AC F = 4, wavenumber k = 7 F = %σ (b) F = 8, wavenumber k = (e) AC F = %σ (c) 2 1 F = 16, wavenumber k = (f) AC F = rspa.ryalscietypublishing.rg Prc R Sc A (g) (h) (i) Figure 6: Mean RMSE f the mst energetic mde and mean AC ver 1 initial cnditins fr the Lrenz 96 system. 1% f the standart deviatin frm the atractr ; Standart deviatin frm the atractr ; AC predictability threshld ; MSM ; GPR ; GPR-MSM ; LSTM ; LSTM-MSM turbulent regime F = 16, LSTM shws cmparable perfrmance with bth GPR and MSM and all methds cnverge as chaticity rises, since the intrinsic dimensinality f the system attractr increases and the system becme inherently unpredictable. In Figure 7, the evlutin f the mean RMSE ver 1 initial cnditins f the wavenumbers k = 8, 9, 1, 11 f the Lrenz 96 with frcing F = 8 is pltted. In cntrast t GPR, the RMSE errr f LSTM is much lwer in the mderate and lw energy wavenumbers k = 9, 1, 11 cmpared t the mst energetic mde k = 8. This difference amng mdes is nt bserved in GPR. This can be attributed t the highly nn-linear energy transfer mechanisms between these lwer energy mdes as ppsed t the Gaussian and lcally linear energy transfers f the mst energetic mde. As illustrated befre, the hybrid LSTM-MSM architecture effectively cmbines the accurate shrt-term predictin perfrmance f LSTM with the lng-term stability f MSM. The percentage f ensemble members in the hybrid scheme explained by LSTM is pltted with respect t time in Figure 8. In parallel with the GPR results presented in [25], the slpe f the percentage drp increases with F up t time t 1.5. Hwever, in cntrast t the results frm GPR reprted in [25], LSTM shws a mre stable behavir as a bigger percentage f the ensembles is explained by it

12 .1 F = 8, wavenumber k = 8.1 F = 8, wavenumber k = %σ (a) (c) F = 8, wavenumber k = 1 1%σ %σ (b) F = 8, wavenumber k = 11 1%σ (d) rspa.ryalscietypublishing.rg Prc R Sc A Figure 7: Mean RMSE f the mst energetic mde (k = 8) and medium and lw energy mdes (k = 9, 1, 11) ver 1 initial cnditins fr the Lrenz 96 system with frcing F = 8. 1% f the standart deviatin frm the atractr ; MSM ; GPR ; GPR-MSM ; LSTM ; LSTM- MSM cmpared t GPR in general. This is because LSTM is a lcal nnlinear attractr apprximatr and can better capture the mean lcal dynamics, while GPR is lcally linear. 1 LSTM dynamics % Time Figure 8: Average percentage ver 5 initial cnditins f the ensemble members evaluated using LSTM dynamics ver time fr different Lrenz 96 frcing regimes in the hybrid LSTM-MSM methd. F = 4 ; F = 8 ; F = 16

13 (b) Kuramt-Sivashinsky Equatin 13 4u 2u u u = ν 4 u, t x x x2 u u u(, t) = u(l, t) = = x x= x (4.9) =, x=l u(x, ) = u (x), where u(x, t) is the mdeled quantity f interest depending n a spatial variable x [, L] and time t [, ]. The negative viscsity is mdeled by the parameter ν >. We impse Dirichlet and secnd-type bundary cnditins t guarantee ergdicity [53]. In rder t spatially discretize (4.9) we use a grid size x with D = L/ x the number f ndes. Further, we dente with ui = u(i x) the value f u at nde i {,..., D}. Discretizatin using a secnd rder finite differences scheme yields u 4ui 1 + 6ui 4ui+1 + ui+2 dui = ν i 2 dt x4 (4.1) u2i+1 u2i 1 u 2ui + ui 1. i+1 4 x x2 Further, we impse u = ud+1 = and add ghst ndes u 1 = u1, ud+2 = ud t accunt fr the Dirichlet and secnd-rder bundary cnditins. In ur analysis, the number f ndes is D = 512. The Kuramt-Sivashinsky equatin exhibits different levels f chas depending n the bifurcatin parameter L = L/2π ν [54]. Higher values f L lead t mre chatic systems [25]. ν = 1/ x ν = 1/ t t 75 ν = 1/16 1 t x x Figure 9: Cntur plts f u(x, t) fr different values f ν in steady state. Chaticity rises with smaller values f ν. In ur analysis the spatial variable bund is held cnstant t L = 16 and chaticity level is cntrlled thrugh the negative viscsity ν, where a smaller value leads t a system with a rspa.ryalscietypublishing.rg Prc R Sc A... The Kuramt-sivashinsky (K-S) system is extensively used in many scientific fields t mdel a multitude f chatic physical phenmena. It was first derived by Kuramt [49,5] as a turbulence mdel f the phase gradient f a slwly varying amplitude in a reactin-diffusin type medium with negative viscsity cefficient. Later, Sivashinsky [51] studied the spntaneus instabilities f the plane frnt f a laminar flame ending up with the K-S equatin, while in [52] the K-S equatin is fund t describe the surface behavir f viscus liquid in a vertical flw. Fr ur study, we restrict urselves t the ne dimensinal K-S equatin with bundary and initial cnditins given by

14 higher level f chas (see Figure 9). The tempral average f the state and the cumulative energy are pltted in Figure 1. As ν declines, chaticity in the system rises and higher scillatins f the mean twards the Dirichlet bundary cnditins are bserved, while the number f mdes needed t capture mst f the energy is higher. In ur study, we cnsider tw values, namely ν = 1/1 and ν = 1/16 t benchmark the predictin skills f the prpsed methd. The discretized equatin (4.1) is integrated with a time interval dt =.2 up t T = 11. The data pints up t T = 1 are discarded as initial transients. Half f the remaining data (N = 25 samples) are used fr training and the ther half fr validatin. ū x Number f mdes used Figure 1: Tempral average u and cumulative mde (PCA) energy fr different values f ν. 1/ν = 1 ; 1/ν = 16 ; 1/ν = 36 Cumulative Energy in % rspa.ryalscietypublishing.rg Prc R Sc A (i) Dimensinality Reductin: Singular Value Decmpsitin The dimensinality f the prblem is reduced using Singular Value Decmpsitin (SVD). By subtracting the tempral mean u and stacking the data, we end up with the data matrix U R N 513, where N is the number f data samples (N = 5 in ur case). Perfrming SVD n U leads t U = MΣV T, M R N N, Σ R N 513, V R , (4.11) with Σ diagnal, with descending diagnal elements. The right singular vectrs crrespnding t the r dim largest singular values are the first clumns f V = [V r, V r ]. Stacking these singular vectrs yields V r R 513 r dim. Assuming that u t R 513 is a vectr f the discretized values f u(x, t) in time t, in rder t get a reduced rder representatin crrespnding t the cmpnents with the highest energies (singular values) we multiply c = V r T u, c R r dim. (4.12) Applying SVD n the data matrix U is equivalent with Principal Cmpnent Analysis n the cvariance matrix as in [25]. The percentage f cumulative energy w.r.t. t the number f cmpnents (mdes) cnsidered is pltted in Figure 1. Further, the 9% threshld is pltted. In ur study, we pick r dim = 2 (ut f 512) mst energetic mdes, as they explain apprximately 9% f the ttal energy. The reduced mdel state is then given by: (ii) Results c [c 1,..., c rdim ] T. (4.13) We train stateless LSTM mdels with h = 1 and d = 5. Fr testing, starting frm 1 initial cnditins unifrmly sampled frm the attractr, we generate a Gaussian ensemble f dimensin N = 5 centered arund the initial cnditin in the riginal space with standard deviatin f

15 σ =.1. This ensemble is prpagated using the LSTM predictin mdels, and GPR, MSM and GPR-MSM mdels trained as in [25]. The rt mean square errr between the predicted ensemble mean and the grund-truth is pltted in Figures 11a, 11b fr different values f the parameter ν. All methds reach the invariant measure much faster fr 1/ν = 16 cmpared t the less chatic regime 1/ν = 1 (nte the different integratin times T = 4 fr 1/ν = 1, while T = 1.5 fr 1/ν = 16). In bth chatic regimes 1/ν = 1 and 1/ν = 16, the reduced rder LSTM utperfrms all ther methds in the shrt term befre escaping the attractr. Hwever, in the lng term, LSTM des nt stabilize and will eventually diverge faster than GPR (see Figure 11b). Blending LSTM with MSM alleviates the prblem and bth accurate shrt term predictins and lng term stability is attained. Mrever, the hybrid LSTM-MSM has better frecasting capabilities cmpared t GPR. The need fr blending LSTM with MSM in the KS equatin is less imperative as the system is less chatic than the Lrenz 96 and LSTM methds diverge much slwer, while they sufficiently capture the cmplex nnlinear dynamics. As the intrinsic dimensinality f the attractr rises LSTM diverges faster. The mean Anmaly Crrelatin (3.1) is pltted with respect t time in Figures 11c and 11d fr ν = 1 and 16 respectively. The evlutin f the AC justifies the afrementined analysis. The mean AC f the trajectry predicted with LSTM remains abve the predictability threshld f.6 fr a highest time duratin cmpared t ther methds. This predictability hrizn is apprximately 2.5 fr ν = 1/1 and.6 fr ν = 1/16, since the chaticity f the system rises and accurate predictins becme mre challenging. 15 rspa.ryalscietypublishing.rg Prc R Sc A 5 v = 1/1 6 v = 1/ RMSE(α1) 3 2 RMSE(α1) (a) (b) 1. v = 1/1 1. v = 1/16 Anmaly Crrelatin Anmaly Crrelatin (c) (d) Figure 11: Mean RMSE f the mst energetic mde and mean AC ver 1 initial cnditins fr the K-S equatin with 1/ν = 1 (11a,11c) and 1/ν = 16 ( 11c,11d). Standard deviatin frm the attractr ; AC predictability threshld ; MSM ; GPR ; GPR-MSM ; LSTM ; LSTM-MSM

16 Fr the hybrid LSTM-MSM, the percentage f the ensemble members that are explained by LSTM dynamics is pltted in Figure 12. The qutient drps slwer fr 1/ν = 1 in the lng run as the intrinsic dimensinality f the attractr is smaller and trajectries diverge slwer. Hwever, in the beginning the LSTM percentage is higher fr 1/ν = 16 as the MSM drives initial cnditins clse t the bundary faster twards the attractr due t the higher damping cefficients cmpared t the case 1/ν = 1. This explains the initial knick in the graph fr 1/ν = 16. The slw damping cefficients fr 1/ν = 1 d nt allw the MSM t drive the trajectries back t the attractr in a faster pace than the diffusin caused by the LSTM frecasting. LSTM Dynamics % Figure 12: Mean ver 1 initial cnditins f the percentage f ensemble members explained by the LSTM dynamics fr the Kuramt-Sivashinsky (T = 1.5) 1/ν = 1 ; 1/ν = rspa.ryalscietypublishing.rg Prc R Sc A

17 (c) A Bartrpic Climate Mdel In this sectin, we examine a standard bartrpic climate mdel [55] riginating frm a realistic winter circulatin. The mdel equatins are given by ζ t = J (ψ, ζ + f + h) + k 1ζ + k 2 δ 3 ζ + ζ, (4.14) where ψ is the streamfunctin, ζ = δψ the relative vrticity, f the Crilis parameter, ζ a cnstant vrticity frcing, while k 1 and k 2 are the Ekman damping and the scale-selective damping cefficient. J is the Jacbi peratr given by ( a B J (a, b) = λ µ a B ), (4.15) µ λ where µ and λ dente the sine f the gegraphical latitude and lngitude respectively. The equatin f the bartrpic mdel (4.14) is nn-dimensinalized using the radius f the earth as unit length and the inverse f the earth angular velcity as time unit. The nn-dimensinal rgraphy h is related t the real Nrthern Hemisphere rgraphy h by h = 2sin(φ )A h /H, where phi is a fixed amplitude f 45 N, A is a factr expressing the surface wind strength blwing acrss the rgraphy, and H a scale height [55]. The stream-functin ψ is expanded int a spherical harmnics series and truncated at wavenumber 21, while mdes with an even ttal wavenumber are excluded, aviding currents acrss the equatr and ending up with a hemispheric mdel with 231 degrees f freedm. The training data are btained by integrating the Eq. (4.14) fr 1 5 days after an initial spin-up perid f 1 days, using a furth-rder Adams-Bashfrth integratin scheme with a 45-min time step in accrdance with [25], with k 1 = 15 days, while k 2 is selected such that wavenumber 21 is damped at a time scale f 3 days. In this way we end up with a time series ζ t with 1 4 samples. The spherical surface is discretized int a D = mesh with equally spaces latitude and lngitude. Frm the gathered data, 9% is used fr training and 1% fr validatin. The mean and variance f the statistical steady state are shwn in Figure 13a. 17 rspa.ryalscietypublishing.rg Prc R Sc A (i) Dimensinality Reductin: Classical Multidimensinal Scaling The riginal prblem dimensin f 231 is reduced using a generalized versin f the classical multidimensinal scaling methd [56]. The prcedure tries t identify an embedding with a lwer dimensinality such that the pairwise inner prducts f the dataset are preserved. Assuming that the dataset cnsists f pints ζ i, i {1,..., N}, whse reduced rder representatin is dented with y i, the prcedure is equivalent with the slutin f the fllwing ptimizatin prblem minimize y 1,...,y N ( ) 2, ζi, ζ j ζ y i, y j y (4.16) i<j where, ζ, and, y dente sme well defined inner prduct f the riginal space ζ and the embedding space y respectively. Prblem (4.16) minimizes the ttal squared errr between pairwise prducts. In case bth prducts are the scalar prducts, the slutin f (4.16) is equivalent with PCA. Assuming nly, y is the scalar prduct, prblem (4.16) als accepts an analytic slutin. Let W ij = ζ i, ζ j ζ be the cefficients f the Gram matrix, k 1 k 2 k N its eigenvalues srted in descending abslute value and u 1, u 2,..., u N the respective eigenvectrs. The ptimal d-dimensinal embedding fr a pint ζ n is given by k 1/2 1 u n 1 k 1/2 2 u n 2 y n =, (4.17). k 1/2 d u n d

18 Mean 18 W Variance 18 W Energy 6 W W (a) 45 N 12 E 6 E Cumulative energy in % 6 W W 45 N 12 E 6 E rspa.ryalscietypublishing.rg Prc R Sc A Mde Number f mdes used (b) (c) Figure 13: Mean, variance and energy distributin f the Bartrpic mdel at statistical steady state. where u n m dentes the n th cmpnent f the m th eigenvectr. The ptimality f (4.17) can be prven by the Eckart-Yung-Mirsky therem, as prblem (4.16) is equivalent with finding the best d rank apprximatin in the Frbenius nrm. In ur prblem, the standard kinetic energy prduct is used t preserve the nnlinear symmetries f the system dynamics [25]: ζ i, ζ j ζ = ψ i ψ j d S = ζ i ψ j d S = ζ j ψ i d S, (4.18) S S S where the last identities are derived using partial integratin and the fact that ζ = y. The energy spectrum f the mdes f the reduced rder space y is pltted in Figure 13a. Slutin (4.17) is nly ptimal w.r.t. the N training data pints used t cnstruct the Gram matrix. In rder t calculate the embedding fr a new pint, it is cnvenient t cmpute the empirical rthgnal functins (EOFs) which frm an rthnrmal basis f the reduced rder space y [25]. The EOFs are given by φ = N n=1 k 1/2 m u n mζ n, (4.19) where m runs frm 1 t d. The EOFs are srted in descending rder accrding t their energy level. The first fur EOFs are pltted in Figure 14. EOF analysis has been used t identify individual realistic climatic mdes such as the Arctic Oscillatin (AO) [57,58] knwn as telecnnectins. The first EOF is characterized by a center f actin ver the Arctic that

19 Mde 1 18 W.2 Mde 2 18 W.2 12 W W W 6 E W.2 Mde 4 18 W.2 12 W W W 6 E E 45 N E E 45 N W Mde 3.1 E 45 N E.1 E 45 N W Figure 14: The fur mst energetic empirical rthgnal functins f the bartrpic mdel As a cnsequence f the rthgnality f the EOFs w.r.t. the kinetic energy prduct, the reduced representatin y f a new state ζ can be recvered frm hζ, φ1 iζ hζ, φ i 2 ζ. y = (4.2)... hζ, φd iζ In essence, the EOFs act as an rthgnal basis f the reduced rder space and the new state ζ is prjected t this basis. Only the d cefficients crrespnding t the mst energetic EOFs frm the reduced rder state y. In ur study, the dimensinality f the reduced space is rdim = 3, as φ3 cntains nly 3.65% f the energy f φ1, while the 3 mst energetic mdes cntain apprximately 82% f the ttal energy, as depicted in Figure 13c. 19 rspa.ryalscietypublishing.rg Prc R Sc A... is surrunded by a znal symmetric structure in mid-latitudes. This pattern resembles the Arctic Oscillatin/Nrthern Hemisphere Annular Mde (AO/NAM) [57] and explains apprximately 13.5% f the ttal energy. The secnd, third and furth EOFs are quantitatively very similar t the East Atlantic/West Russia [59], the Pacific/Nrth America (PNA) [6] and the Trpical/Nrthern Hemisphere (TNH) [61] patterns end accunt fr 11.4%, 1.4% and 7.1% f the ttal energy respectively. Since these EOFs feature realistic climate telecnnectins, perfrming accurate predictins f them is f high practical imprtance.

20 (ii) Training and Predictin The reduced rder state that we want t predict using the LSTM are the 3 cmpnents f y. A stateless LSTM with h = 14 hidden units is cnsidered, while the truncated back-prpagatin hrizn is set t d = 1. The prttypical system is less chatic than the KS equatin and the Lrenz 96, which enables us t use mre hidden units. The reasn is that as chaticity is decreased trajectries sampled frm the attractr as training and validatin dataset becme mre intercnnected and the task is inherently easier and less prne t verfitting. In the extreme case f a peridic system, the infrmatin wuld be identical. 5 pints are randmly picked frm the attractr as initial cnditins fr testing. A Gaussian ensemble with a small variance (σ en =.1) alng each dimensin is frmed and marched using the reduced-rder GPR, MSM, Mixed GPR-MSM and LSTM methds. (iii) Results The RMSE errr f the fur mst energetic reduced rder space variables y i fr i {1,..., 4} is pltted in Figure 15. The LSTM takes 4 5 h t reach the attractr, while GPR based methds generally take 3 4 h. In cntrast, the MSM reaches the attractr already after 1 hur. This implies that the LSTM can better capture the nn-linear dynamics cmpared t GPR. Nte that the bartrpic mdel is much less chatic than the Lrenz 96 system with F = 16, where all methds shw cmparable predictin perfrmance. Blended LSTM mdels with MSM are mitted here, as LSTM mdels nly reach the attractr standard deviatin twards the end f the simulated time and MSM-LSTM shws identical perfrmance. 2 rspa.ryalscietypublishing.rg Prc R Sc A RMSE (y1).3.2 RMSE (y2) Time (hurs) Time (hurs) (a).5.4 (b).5.4 RMSE (y3).3.2 RMSE (y4) Time (hurs) Time (hurs) (c) (d) Figure 15: Mean RMSE f the mst energetic EOFs ver 5 initial cnditins fr the Bartrpic climate mdel. Standard deviatin frm the attractr ; MSM ; GPR ; GPR-MSM ; LSTM

21 5. A Cmment n Cmputatinal Cst f Predictin The cmputatinal cst f making a single predictin can be quantified by the number f peratins (multiplicatins and additins) needed. In GPR based appraches the cmputatinal cst in the Landau ntatin is O(N 2 ), where N is the number f samples used in training. Fr GPR methds illustrated in the previus sectin N 25. The GPR mdels the glbal dynamics by unifrmly sampling the attractr and "carries" this training dataset at each time instant t identify the gemetric relatin between the input and the training dataset and make (exact) prbabilistic inference n the utput. In cntrast, LSTM learns the behavir by adjusting its parameters, which leads t a predictin cmputatinal cmplexity that des nt depend n the number f samples used fr training. The inference cmplexity is rughly O(d i d h + d h 2 ), where d i is the dimensin f each input, d is the number f inputs and h is the number f hidden units. This cmplexity is significantly smaller than GPR, which can be translated t faster predictin. Especially in real-time applicatins that require fast shrt-term predictins f a cmplex system, the LSTM has an advantage. Hwever, it is lgical that the LSTM is mre prne t diverge frm the attractr, as there is n guarantee that the infrequent training samples near the attractr limits where memrized. This remark explains the faster divergence f LSTM in the mre turbulent regimes cnsidered in Sectin Cnclusins We prpse a data-driven methd, based n lng-shrt term memry netwrks, fr mdeling and predictin in the reduced space f chatic dynamical systems. The LSTM uses the shrt term histry f the reduced rder variable t predict the state derivative and uses it fr ne-step predictin. The netwrk is trained n time-series data and it requires n prir knwledge f the underlying gverning equatins. Using the trained netwrk, lng-term predictins are made by iteratively predicting ne step frward. The features f the prpsed technique are shwcased thrugh cmparisns with GPR and MSM n bench-marked cases. Three applicatins are cnsidered, the Lrenz 96 system, the Kuramt-Sivashinsky equatin and a bartrpic climate mdel. The chaticity f these systems ranges frm weakly chatic t fully turbulent, ensuring a cmplete simulatin study. Cmparisn measures include the RMSE and AC between the predicted trajectries and trajectries f the real dynamics. In all cases, the prpsed apprach perfrms better, in shrt term predictins, as the LSTM is mre efficient in capturing the lcal dynamics and cmplex interactins between the mdes. Hwever, the predictin errr prpagates fast and the predictin similar t GPR des nt cnverge t the invariant measure. Furthermre in the cases f increased chaticity the LSTM diverges faster than GPR. This may be attributed t the nn-presence f certain attractr regins in the training data, insufficient training, and prpagatin f the expnentially increasing predictin errr. T mitigate this effect, LSTM is als cmbined with MSM, fllwing ideas presented in [25], in rder t guarantee cnvergence t the invariant measure. Blending LSTM r GPR with MSM leads t a deteriratin in the shrt term predictin perfrmance but the steady-state statistical behavir is captured. The hybrid LSTM-MSM exhibits a slightly superir perfrmance than GPR-MSM in all systems cnsidered in this study. In the Kuramt-Sivashinsky equatin LSTM can capture better the lcal dynamics cmpared t Lrenz 96 due t the lwer intrinsic dimensinality f the attractr. The LSTM shws cmparable frecasting accuracy with GPR in the bartrpic mdel. The intrinsic dimensinality is significantly smaller than Kuramt-Sivashinsky and Lrenz 96 and bth methds can effectively capture the dynamics. Mrever, the predictin errr des nt prpagate as rapidly as in Lrenz 96 and the blended LSTM-MSM scheme is mitted. Future directins include mdeling the lwer energy mdes and interplatin errrs using a stchastic cmpnent in the LSTM t imprve the frecasting accuracy. Anther pssible 21 rspa.ryalscietypublishing.rg Prc R Sc A

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft