arxiv: v2 [cs.ai] 16 Feb 2016

Similar documents
CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Proportional-Integral-Derivative PID Controls

Panos Kouvelis Olin School of Business Washington University

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

Solutions of burnt-bridge models for molecular motor transport

CONSTRUCTION OF MIXED SAMPLING PLAN WITH DOUBLE SAMPLING PLAN AS ATTRIBUTE PLAN INDEXED THROUGH (MAPD, MAAOQ) AND (MAPD, AOQL)

silicon wafers p b θ b IR heater θ h p h solution bath cleaning filter bellows pump

Subject: Modeling of Thermal Rocket Engines; Nozzle flow; Control of mass flow. p c. Thrust Chamber mixing and combustion

INFORMATION TRANSFER THROUGH CLASSIFIERS AND ITS RELATION TO PROBABILITY OF ERROR

EconS 503 Homework #8. Answer Key

Simple Cyclic loading model based on Modified Cam Clay

EE451/551: Digital Control. Relationship Between s and z Planes. The Relationship Between s and z Planes 11/10/2011

USING GENETIC ALGORITHMS FOR OPTIMIZATION OF TURNING MACHINING PROCESS

A General Approach for Analysis of Actuator Delay Compensation Methods for Real-Time Testing

Risk Analysis in Water Quality Problems. Souza, Raimundo 1 Chagas, Patrícia 2 1,2 Departamento de Engenharia Hidráulica e Ambiental

arxiv: v1 [cs.db] 30 Jun 2012

Complexity of Regularization RBF Networks

, given by. , I y. and I z. , are self adjoint, meaning that the adjoint of the operator is equal to the operator. This follows as A.

INCOME AND SUBSTITUTION EFFECTS

2.2 BUDGET-CONSTRAINED CHOICE WITH TWO COMMODITIES

NONLINEAR MODEL: CELL FORMATION

Feedback-error control

Maintaining prediction quality under the condition of a growing knowledge space. A probabilistic dynamic model of knowledge spaces quality evolution

CHAPTER 16. Basic Concepts. Basic Concepts. The Equilibrium Constant. Reaction Quotient & Equilibrium Constant. Chemical Equilibrium

INTUITIONISTIC FUZZY SOFT MATRIX THEORY IN MEDICAL DIAGNOSIS USING MAX-MIN AVERAGE COMPOSITION METHOD

Maximum Entropy and Exponential Families

Fundamental Theorem of Calculus

Lecture 3 - Lorentz Transformations

7 Max-Flow Problems. Business Computing and Operations Research 608

Fast, Approximately Optimal Solutions for Single and Dynamic MRFs

Summary of EUV Optics Contamination Modeling Meeting

Brownian Motion and Random Prime Factorization

SOLVED QUESTIONS 1 / 2. in a closed container at equilibrium. What would be the effect of addition of CaCO 3 on the equilibrium concentration of CO 2?

Planar Undulator Considerations

SYNTHESIS AND OPTIMIZATION OF EPICYCLIC-TYPE AUTOMATIC TRANSMISSIONS BASED ON NOMOGRAPHS

EXTENDED MATRIX CUBE THEOREMS WITH APPLICATIONS TO -THEORY IN CONTROL

arxiv:gr-qc/ v2 6 Feb 2004

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

General Equilibrium. What happens to cause a reaction to come to equilibrium?

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS

Measuring & Inducing Neural Activity Using Extracellular Fields I: Inverse systems approach

NONLINEAR ADAPTIVE OBSERVER DESIGN IN COMBINED ERROR NONLINEAR ADAPTIVE CONTROL

On a Networked Automata Spacetime and its Application to the Dynamics of States in Uniform Translatory Motion

Lightpath routing for maximum reliability in optical mesh networks

Properties of Space Set Topological Spaces

UTC. Engineering 329. Proportional Controller Design. Speed System. John Beverly. Green Team. John Beverly Keith Skiles John Barker.

General Closed-form Analytical Expressions of Air-gap Inductances for Surfacemounted Permanent Magnet and Induction Machines

On Some Coefficient Estimates For Certain Subclass of Analytic And Multivalent Functions

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

Any AND-OR formula of size N can be evaluated in time N 1/2+o(1) on a quantum computer

An Integrated Architecture of Adaptive Neural Network Control for Dynamic Systems

Green s function for the wave equation

Wavetech, LLC. Ultrafast Pulses and GVD. John O Hara Created: Dec. 6, 2013

Tuning of PID Controllers for Unstable Continuous Stirred Tank Reactors

Real-time Hand Tracking Using a Sum of Anisotropic Gaussians Model

Supplementary Materials

arxiv: v1 [cs.gt] 21 Jul 2017

Critical Reflections on the Hafele and Keating Experiment

Hankel Optimal Model Order Reduction 1

An Efficient Implementation of Linear-Phase FIR Filters for Rational Sampling Rate Conversion

Application of Drucker-Prager Plasticity Model for Stress- Strain Modeling of FRP Confined Concrete Columns

Emergence of Superpeer Networks: A New Perspective

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

VIBRATIONAL ENERGY SCAVENGING WITH SI TECHNOLOGY ELECTROMAGNETIC INERTIAL MICROGENERATORS

Radial Basis Function Networks: Algorithms

Methods of evaluating tests

NUMERICAL SIMULATIONS OF METAL ONTO COMPOSITES CASTING

Name Solutions to Test 1 September 23, 2016

Investigating the Performance of a Hydraulic Power Take-Off

TOWARD THE DEVELOPMENT OF A METHODOLOGY FOR DESIGNING HELICOPTER FLIGHT CONTROL LAWS BY INTEGRATING HANDLING QUALITIES REQUIREMENTS

Capturing Register and Control Dependence in Memory Consistency Models with Applications to the Itanium Architecture

Theory. Coupled Rooms

The simulation analysis of the bridge rectifier continuous operation in AC circuit

Symmetric Root Locus. LQR Design

Chapter 2 Conduction Heating of Solid Surfaces

Takeshi Kurata Jun Fujiki Katsuhiko Sakaue. 1{1{4 Umezono, Tsukuba-shi, Ibaraki , JAPAN. fkurata, fujiki,

Flow Characteristics of High-Pressure Hydrogen Gas in the Critical Nozzle

Assessing the Performance of a BCI: A Task-Oriented Approach

EM Decoding of Tardos Traitor Tracing Codes

Smoldering Combustion of Horizontally Oriented Polyurethane Foam with Controlled Air Supply

The Spread of Internet Worms and the Optimal Patch Release Strategies

The Effectiveness of the Linear Hull Effect

Optimal control of inverted pendulum system using PID controller, LQR and MPC

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS

Selection of 'optimal' poles for SISO pole placement design: SRL LQR design example Prediction and current estimators

Investigation of the Potential of Long Wave Radiation Heating to Reduce Energy Consumption for Greenhouse Heating

The Procedure of Finding the Stress-Energy. Tensor and Equations of Vector Field of Any Form

A STRUCTURED CONSTITUTIVE MODEL FOR SIMULATING THE BEHAVIOUR OF AN OVERCONSOLIDATED BONDED CLAY

Reliability Guaranteed Energy-Aware Frame-Based Task Set Execution Strategy for Hard Real-Time Systems

System Modeling Concepts

Acoustic Waves in a Duct

Case I: 2 users In case of 2 users, the probability of error for user 1 was earlier derived to be 2 A1

Theory of the Integer and Fractional Quantum Hall Effects

Developing Excel Macros for Solving Heat Diffusion Problems

A Spatiotemporal Approach to Passive Sound Source Localization

A. A. Salama 1 and Florentin Smarandache 2

Control Theory association of mathematics and engineering

The gravitational phenomena without the curved spacetime

Variation Based Online Travel Time Prediction Using Clustered Neural Networks

State Diagrams. Margaret M. Fleck. 14 November 2011

Transcription:

Cells in Multidimensional Reurrent Neural Networks arxiv:42.2620v2 [s.ai] 6 Feb 206 Gundram Leifert Tobias Sauß Tobias Grüning Welf Wustlih Roger Labahn University of Rostok Institute of Mathematis 805 Rostok, Germany Editor: Yoshua Bengio Keywords: LSTM, MDRNN, CTC, handwriting reognition, neural network Absat GUNDRAM.LEIFERT@UNI-ROSTOCK.DE TOBIAS.STRAUSS@UNI-ROSTOCK.DE TOBIAS.GRUENING@UNI-ROSTOCK.DE WELF.WUSTLICH@UNI-ROSTOCK.DE ROGER.LABAHN@UNI-ROSTOCK.DE The ansrition of handwritten text on images is one task in mahine learning and one solution to solve it is using multi-dimensional reurrent neural networks MDRNN with onnetionist temoral lassifiation CTC. The RNNs an ontain seial units, the long short-term memory LSTM ells. They are able to learn long term deendenies but they get unstable when the dimension is hosen greater than one. We defined some useful and neessary roerties for the one-dimensional LSTM ell and extend them in the multi-dimensional ase. Thereby we inodue several new ells with better stability. We resent a method to design ells using the theory of linear shift invariant systems. The new ells are omared to the LSTM ell on the IFN/ENIT and Rimes database, where we an imrove the reognition rate omared to the LSTM ell. So eah aliation where the LSTM ells in MDRNNs are used ould be imroved by substituting them by the new develoed ells.

GUNDRAM LEIFERT ET AL. <7.2.206>. Inodution Sine the last deade, artifiial neural networks NN beame state-of-the-art in many fields of mahine learning, for examle they an be alied to attern reognition. Tyial NN are feedforward NN FFNN or reurrent NN RNN, whereas the latter ontain reurrent onnetions. When nearby inuts deend on eah other, roviding these inuts as additional information to the NN an imrove its reognition result. FFNNs obtain these deendenies by making this nearby inuts aessible. If RNNs are used, the reurrent onnetions an be used to learn if the surrounding inut is relevant, but these onnetions result in a vanishing deendeny over time. In S. Hohreiter 997 the authors develo the long short-term memory LSTM whih is able to have a long term deendeny. This LSTM is extended in A. Graves and Shmidhuber 2007 to the multi-dimensional MD ase and is used in a hierarhial multi-dimensional RNN MDRNN whih erformed best in three ometitions at the International Conferene on Doument Analysis and Reognition ICDAR in 2009 without any feature exation and knowledge of the reognized language model. In this aer we analyse these MD LSTM regarding the ability to rovide long term deendenies in MDRNNs and show that it an easily have an unwanted growing deendeny for higher dimensions. We define a more general desrition of an LSTM a ell and hange the LSTM arhiteture whih leads to new MD ell tyes, whih also an rovide long term deendenies. In two exeriments we show that substituting the LSTM in MDRNNs by these ells works well. Due to this we assume that substituting the LSTM ell by the best erforming ell, the LeakyLP ell, will imrove the erformane of an MDRNN also in other senarios. Furthermore the new ell tyes ould also be used for the one-dimensional D ase, so using them in a bidiretional RNN with LSTMs BLSTM ould lead to better reognition rates. In Setion 2 we inodue the reader to the develoment of the LSTM ells S. Hohreiter, 997 and its extension F. A. Gers and Cummins, 999. Based on that in Setion 3 we define two roerties that robably lead to the good erformane of the D LSTM ells. Both together guarantee that the ell an have a long term deendeny. A third roerty ensures that gradient annot exlode over time. In Setion 4 we show that the MD version of the LSTM is still able to rovide long term deendeny whereas the gradient an exlode easily for dimension greater than. In Setion 5 we hange the arhiteture of the MD LSTM ell and redue it to the D LSTM ell so that the ell fulfills the two roerties for any dimension. Nevertheless the internal ell state an linearly grow over time. This roblem is solved in Setion 6 using a ainable onvex ombination of the inut and the revious internal ell states. The new ell tye an rovide long term deendenies and does not suffer from exloding gradients. Motivated by the last setions we inodue a more general way to define MD ells in Setion 7. Using the theory of linear shift-invariant systems and their frequeny analysis we are able to get a new interretation of the ells and we reate 5 new ell tyes. To test the erformane of the ells in Setion 8 we take two data sets from the ICDAR 2009 ometitions, where the MDRNNs with LSTM ell won. On these data sets we omare the reognition results of the MDRNNs when we substitute the LSTM ells by the new develoed ells. On both data sets, the IFN/ENIT data set and the RIMES data set we an imrove the reognition rate using the new develoed ells. 2

CELLS IN MDRNNS <7.2.206> y H t y I t γ y γ t := y H t y I t netγ t f γ y γ t Figure : Shemati diagram of a unit: The unit γ H is a simle neuron with the network s feed forward inut y I t = y i t i I and reurrent inut y Ht = y h t h H and an outut ativation y γ t. Right: A unit has an inut ativation net γ t, whih is a linear ombination of the soure ativations y I t, y H t. The outut ativation y γ t is omuted by alying the ativation funtion f γ to the inut ativation. Left: The short notation of a unit. 2. Previous Work In this setion we briefly want to inodue a reurrent neural network RNN and the develoment of the LSTM ell. In revious literature there are various notation to desribe the udate equations of RNNs an LSTMs. To unify the notations we will refer to their notation using F. A. Gers and Cummins, 999; S. Hohreiter, 997; Graves and Shmidhuber, 2008. Therefore we onenate on a simle hierarhial RNN with one inut layer with the set of neurons I, one reurrent hidden layer with the set of neurons H and one outut layer with the set of neurons O. For eah time ste t N the layers are udated asynhronously in the order I, H, O. In one seifi layer all neurons an be udated synhronously. In the hidden layer for one neuron H at time t N we alulate the neuron s inut ativation net by a t net t = w,i y i t + w,h y h t. i I h H with weights w [target neuron],[soure neuron]. A bias in an be added by extending the set I := I {bias} with y bias t = t N and hene we will not write the bias in the equations, but we use them in our RNNs in Setion 8. The neuron s outut ativation is alulated by y t, b t y t = f net t with a differentiable sigmoid ativation funtion f. To make suitable for t 0 we define h H, t Z \ N : y h t = 0. This simle neuron with a linear funtion of ativations as inut and one ativation funtion we all unit omare to Figure. In the ativation of the unit is deendent on the urrent ativations of the layer below and the revious ativations of the units from the same layer. When there are no reurrent onnetions, h H : w,h = 0, the layer is alled feed-forward layer, otherwise reurrent layer. 2. The Long Short-Term Memory A standard LSTM ell has one inut with an inut ativation y in t a set of gates, one internal state s and one outut-ativation y y. The gates are also units and their task is to learn whether a signal should ass the gate or not. They almost always have the logisti ativation funtion 3

GUNDRAM LEIFERT ET AL. <7.2.206> f log x := +ex x f x. The inut of the standard LSTM ell is alulated from a unit with an odd ativation funtion with a sloe of at x = 0. We use f x = tanh x in this aer, another solution ould be f x = 2 tanh x 2 see S. Hohreiter, 997. The standard LSTM has two gates: The inut gate IG or ι and the outut gate OG or ω. These both gates are alulated like a unit, so that and y in t, b ι t y out t, b ω t net ι t = i I y ι t = f log net ι t net ω t = i I y ω t = f log net ω t. w ι,i y i t + h H w ι,h y h t w ω,i y i t + h H w ω,h y h t The inut of an LSTM is defined like in by net t net in t = w,i y i t + w,h y h t, i I h H g net t, f 2 net t y in t = f net in t. The internal state s t is alulated by s t = y in t y ι t + s t, 2 the outut ativation y t of the LSTM is alulated from y t, b t y t = h s t y ω t 3 with h x := tanhx f 3 x. The LSTM an be interreted as a kind of memory module where the internal state stores the information. For a given inut y in t, the IG deides if the new inut is relevant for the internal state. If so, the inut is added to the internal state. The information of the inut is now saved in the ativation of the internal state. The OG determines whether or not the internal ativation should be dislayed to the rest of the network. So the information, stored in the LSTM is just readable when the OG is ative. To sum u, an oen IG an be seen as a write -oeration into the memory and an oen OG as a read -oeration of the memory. Another way to understand the LSTM is to take a look at the gradient roagated through it. To analyse the LSTM roerly, we have to ignore gradients omming from reurrent weights. We define the unated gradient similar to S. Hohreiter 997 and F. A. Gers and Cummins 999. Definition unated gradient Let γ { in, ι, ω} be any inut or gate unit and y t any revious outut ativation. The unated gradient differs from the exat gradient only by setting reurrent weighted gradient roagation netγt y t to zero. We write net γ t = w γ, = 0. y t 4

CELLS IN MDRNNS <7.2.206> Now, let E be an arbiary error whih is used to ain the RNN and Et y t the resulting derivative at the outut of the LSTM. The OG an eliminate the gradient oming from the outut, beause y t s t = h s t y }{{} ω t, }{{} 0,] 0, so the OG deides when the gradient should go into the internal state. Eseially for s t we get y t s t y ωt. The key idea of the LSTMs is that an error that ours at the internal state neither exlode nor vanish over time. Therefore, we take a look at the artial derivative st s t, whih is also known as error arousel for more details see S. Hohreiter, 997. Using the unated gradient of Definition for this derivative, we get s t s t =y y ι t in t s t + y y in t ιt s t + y ι t y t =y in t y t s t + y y in t y t ιt y t s t + =y in t + y ι t y ι t net ι t net ι t y t }{{} = 0 y t s t y in t net in t y t net in t y t s }{{} t + = 0 s t s t =. 4 So, one having a gradient at the internal state we an use the hain rule and get τ N : s t s t τ =. This is alled onstant error arousel. Like the OG an eliminate the gradient oming from the LSTM outut, the IG an do the same with the gradient oming from the internal state, that means it deides when the gradient should be injeted to the soure ativations. This an be seen by taking a look at the artial derivative s t net in t = s t y in t y in t net in t = y ιtf net in t. If there is a small inut net in t, we get f net in t and an estimate s t net in t y ιt. All in all, this LSTM is able to store information and learn long-term deendenies, but it has one drawbak whih will be disussed in 2.2. 5

GUNDRAM LEIFERT ET AL. <7.2.206> 2.2 Learning to Forget For long time series the internal state is unbounded omare with F. A. Gers and Cummins, 999, 2.. Assuming a ositive or negative inut and a non zero ativation of the IG, the absolute ativation of the internal state grows over time. Using the weight-sae symmeies in a network with at least one hidden layer Bisho, 2006, 5.. we assume without loss of generality y in t 0, so s t t. Hene, the ativation funtion h saturates and 3 an be simlified to y t = h s t y }{{} ω t y ω t. Thus, for great ativations of s t the whole LSTM works like a unit with a logisti ativation funtion. A similar roblem an be observed for the gradient. The gradient oming from the outut is multilied by the ativation of the OG and the derivative of h. For great values of s t we get h s t 0 and we an estimate the artial derivative y t s t = h s t y ω t 0, whih an be interreted that the OG is not able to roagate bak the gradient into the LSTM. Some solutions to solve the linear growing state roblem are inodued in F. A. Gers and Cummins 999. They ied to stabilize the LSTM with a state deay by multilying the internal state in eah time ste with a value 0,, whih did not imrove the erformane. Another solution was to add an additional gate, the forget gate FG or φ. The last state s t is multilied by the ativation of the FG before it is added to the urrent state s t. So we an substitute 2 by so that the unated gradient in 4 is hanged to s t = y in t y ι t + s t y φ t, s t s t = y in t = y φ t and for longer time series we get τ N y ι t s t + y ιt τ s t s t τ = y φ t t. t =0 y in t s t + y φt Now, the Extended LSTM is able to learn to forget its revious state. However, an Extended LSTM is still able to work like an standard LSTM without FG by having an ativation y φ t. In this aer we denote the Extended LSTM as LSTM Another oint of view was inodue in Bengio et al. 994: To learn long-term deendenies a system must have an arhiteture to that an inut an be saved over long time and does not suffer from the vanishing gradient roblem. On the other hand the system should avoid an exloding gradient, whih means that a small disturbane has a growing influene over time. In this aer we do not want to solve the roblem of vanishing and exloding gradient for a whole system, we want to solve this roblem only for one single ell. But we think that it is an neessary ondition to rovide long time deendenies of a system. 6

CELLS IN MDRNNS <7.2.206> y I t y H t y I t y H t γ Γ γ Γ y Γ t y H t y I t in y in t s t y t g int g out memory s t,...,s t k Figure 2: Shemati diagram of a ell: The funtion g int alulates the internal state s t from the revious internal states s t,..., s t k and the ell inut y in t using the gate ativations y Γ t. The funtion g out alulates the outut y t of the ell from the atual and revious internal states s t,..., s t k, the ell inut y in t also using the gate ativations y Γ t. 3. Cells and Their Proerties In this setion we want to inodue a general ell and figure out roerties for these ells whih robably lead to the good erformane observed by LSTM ells. Definition 2 Cell, f. Fig. 2 A ell,, of order k onsists of one designated inut unit, in, with sigmoid ativation funtion f tyially f = tanh unless seified otherwise; a set Γ not ontaining in of units alled gates γ, γ 2,... with sigmoid ativation funtions f γi, i =,... tyially logisti f γi = f log unless seified otherwise; an arbiary funtion, g int, and a ell ativation funtion, g out, maing into [, ]. Eah unit of Γ { in } reeives the same set of inut ativations. The ell udate in time ste t N is erformed in three subsequent hases:. Following the lassial udate sheme of neurons see Setion 2, all units in Γ { in } alulate synhronously their ativations, whih will be denoted by y Γ t := y γ t γ Γ and y in t. Furthermore, we all y in t the inut ativation of the ell. 7

GUNDRAM LEIFERT ET AL. <7.2.206> y I t y H t y I t y H t y I t y H t Γ ι φ ω y H t y I t in + s t y t memory s t Figure 3: Shemati diagram of a one-dimensional LSTM ell: The inut in is multilied by the IG ι. The revious state s t is gated by the FG φ and added to the ativation oming from the IG and inut. The outut of the ell is the squashed internal state squashed by h x = tanhx and gated by the OG ω. 2. Then, the ell omutes it s so-alled internal state s t := g int y Γ t, y in t, s t,..., s t k. 3. Finally, the ell omutes it s so-alled outut ativation y t := g out y Γ t, y in t, s t, s t,..., s t k. In this aer we onenate on first order ells k =. Now, we use Definition 2 to re-inodue the Extended LSTM ell. Remark 3 LSTM ell An LSTM ell is a ell of order where h = tanh and Γ = {ι, φ, ω} s t := g int y Γ t, y in t, s t := y in ty ι t + s t y φ t y t := g out y Γ t, s t := h s t y ω t 8

CELLS IN MDRNNS <7.2.206> Proerties of ells. Develoing the D LSTM ells, the main idea is to save exatly one iee of information over a long time series and to roagate the gradient bak over this long time, so that the system an learn reise storage of this iee of information. In instane a given inut y in whih reresent the information at time t in should be stored into the ell state s until the information is required at time t out. To be able to rove the following roerties, we will assume the unated gradient defined in Definition. Nevertheless we will use the full gradient in our Exeriments, beause it turned out that it works muh better. The next two roerties of a ell ensure the ability to work as suh a memory. The first roerty should ensure that an inut y in at time t in an be memorized the ell inut is oen in the internal ativation s until t out the ell memorizes and has a negligibly influene on the internal ativation for t > t out the ell forgets. In addition, the ell is able to revent influene of other inuts at time stes t t in the ell inut is losed. Definition 4 Not vanishing gradient NVG A ell allows an NVG : For arbiary t in, t out N, t in t out, δ > 0 there exist gate ativations y Γ t suh that for any t, t 2 N { s t 2 [ δ, ] for y in t t = t in and t in t 2 t out 5 [0, δ] otherwise holds. The next definition guaranties that at any time t N the gate ativations an the ell outut is oen or not the ell outut is losed disibute the iee of information saved in s to the network. This is an imortant roerty beause the iee of information an be memorized in the ell without resenting it to the network. Note that the deision is just deendent on gate ativations at time t and there are no onsaints to revious gate ativations. In Definition 2 we require y t [, ] whereas s t R. So we annot have arbiarily small intervals of the derivative as in 5, but we an ensure two distint intervals for oen and losed ell outut. When we take Definition 4 and 5 together, a ell is able to save an inut over long term series, an deide at eah time ste whether or not it is resented to the network and an forget the saved inut. Definition 5 Conollable outut deendeny COD A ell of order k allows an COD : There exist δ, δ 2 0,, δ 2 < δ so that for any time t N there exists a gate vetor y Γ t leading to oen outut deendeny y t s t [δ, ] 6 and there exists another gate vetor y Γ t leading to a losed outut deendeny y t s t [0, δ 2]. 7 The third roerty is a kind of stability riterion. An unwanted ase is that a small hange aused by any noisy signal at time ste t in has a growing influene at later time stes. This is equivalent to an exloding gradient over time. Conolling the gradient of the whole system and avoiding him not to exlode is a hard roblem. But we an at least avoid the exloding gradient in one ell. This should be rohibited for any gate ativations. 9

GUNDRAM LEIFERT ET AL. <7.2.206> Definition 6 Not exloding gradient NEG A ell has an NEG : For any time stes t in, t N, t in < t and any gate ativations y Γ t the unated gradient in bounded by s t s t in [0, ]. We think that a ell fulfilling these three roerties an work as stable memory. To be able to rove these roerties for the LSTM ell we have to onsiderate the gate ativations. In general, the ativation funtion of the gates does not have to be the logisti ativation funtion f log, whereas for this aer we set γ Γ : f γ := f log. So the ativation of gates an never be exatly 0 or, beause of a finite inut ativation net γ t to the gate ativation funtion. But a gate an have an ativation y γ t [ ε, if it is oened or y γ t 0, ε] if it is losed, beause for a realisti large inut ativation net γ t 7 low inut ativation net γ t < 7 we get an ativation within the interval y γ t [ ε, y γ t 0, ε] with ε < 000. Handling with these ativation intervals we an rove the definitions for the LSTM ell. Now we an rove whether or not the LSTM ell has these roerties. Theorem 7 Proerties of the LSTM ell The D LSTM ell allows NVG and has an NEG, but does not allow COD. Proof see A. in aendix. 4. Exanding to More Dimensions In A. Graves and Shmidhuber 2007 the D LSTM ell is extended to an arbiary number of dimensions; this is solved by using one FG for eah dimension. In many ubliations using the MD LSTM ell in MDRNNs outerform state-of-the-art reognition systems for examle see Graves and Shmidhuber, 2008. But by exanding the ell to the MD ase, the absolute value of the internal state s an grow faster than linear over time. When s and there are eehole onnetions for eehole onnetion details see F. A. Gers and Shmidhuber, 2002, the ells have an outut ativation of y {, 0, }: The internal state multilied by the eehole weight overlays the other ativationweight-roduts and this leads to an ativation of the OG y ω {0, } and a squashed internal state h s {, }. So the outut of the ell is y ω h s = y {, 0, }. But also without eehole onnetions the internal state an grow, whih leads to h s {, } and the ell works like a onventional unit with a logisti ativation funtion y t ±y ω t. Our goal is to ansfer the Definitions 4, 5 and 6 defined in Setion 3 into the MD ase and we will see that the MD LSTM ell has an exloding gradient. In the next setions we will rovide alternative ell tyes, that fulfill two or all of these definitions. In the D ase it is lear, that there is just one way to ome from date t to date t 2, when t < t 2, by inrementing t as long as t 2 is reahed. For the MD ase the number of aths deends on the number of dimensions and the distane between these two dates. An MD ath is defined as follows. 0

CELLS IN MDRNNS <7.2.206> Definition 8 MD ath Let, q N D be two dates. A -q-ath π of length k 0 is a sequene π := { = 0,,..., k = q} with i {,..., k}!d {,..., D} : i d = i. Further, let π i := i. We an define the distane vetor q := q = q. q D D = q. q D between the dates and q. When q has at least one negative omonent, there exists no -q-ath. Otherwise there exist exatly D D #{ i= q i i= q i! q} := q,..., = D q D i= q i! -q-aths omare with the multinomial oeffiient. We write < q when #{ q} and q when = q < q. Now we an extend the definitions of the D ase to the MD ase, whereas we onenate on the MD ells of order. Definition 9 MD ell An MD ell,, of order and dimension D onsists of the same arts as a D ell of order. The ell udate in date N D is erformed in three subsequent hases:. Following the lassial udate sheme of neurons see Setion 2, all units in Γ { in } synhronously alulate their ativations, whih will be denoted by y Γ = y γ γ Γ. Furthermore, we all y in the inut ativation of the ell. 2. Then, the ell omutes it s so-alled internal state s := g int y Γ, y in, s,..., s D. 3. Finally, the ell omutes it s so-alled outut ativation y := g out y Γ, y in, s, s,..., s D. Using this, we an reinodue the LSTM ell as well as Definition 4, 5 and 6 for the MD ase: Definition 0 MD LSTM ell An MD LSTM ell is a ell of dimension D and order where h = tanh and Γ = {ι, φ,,..., φ, D, ω} s = g int y Γ, y in, s,..., s D = y ι y in + D s d y φ,d d=

GUNDRAM LEIFERT ET AL. <7.2.206> y = g out y Γ, s = h s y ω Definition MD Not vanishing gradient NVG An MD ell allows an NVG : For arbiary in, out N D, in out, δ > 0 there exist N D gate ativations y Γ suh that for any, 2 N D s 2 y in { [ δ, ] for = in and in 2 out [0, δ] otherwise 8 holds. Definition 2 MD Conollable outut deendeny COD An MD ell allows an COD : There exist δ, δ 2 0,, δ 2 < δ so that for any time t N there exists a gate vetor y Γ leading to oen outut deendeny y s [δ, ] 9 and there exists another gate vetor y Γ leading to a losed outut deendeny y s [0, δ 2 ]. 0 Definition 3 MD Not exloding gradient NEG An MD ell has an NEG : For any time stes in, N D, in < and any gate ativations y Γ the unated gradient in bounded by s [0, ] We an now onsider these definitions for the MD LSTM ell. Theorem 4 NVG of MD LSTM ells An MD LSTM ell allows an NVG. Proof see A.2 in aendix For arbiary ativations of FGs the artial derivative s an grow over time: Theorem 5 NEG of MD LSTM ells An MD LSTM ell an have an exloding gradient, when D 2. Proof see A.3 in aendix. The MD LSTM ell does not allow the COD, beause the D ase is a seial ase of the MD ase. Our idea for the next setion is to hange the MD LSTM layout, so that it has an NEG. 2

CELLS IN MDRNNS <7.2.206> 5. Reduing the MD LSTM Cell to One Dimension In the last setion, we showed that the MD LSTM ell an have an exloding gradient. We ied different ways to solve this roblem. For examle we divided the ativation of the FG by the number of dimensions. Then the gradient annot exlode over time, but the gradient vanishes along some aths raidly. Another aroah was to give the ells the oortunity to learn to stabilize itself, when the internal state starts diverging. Therefore we add an additional eehole onnetion between the 2 square value of the revious internal states s d and the FGs so that the ell is able to learn that it has to lose the FG for large internal states. This also does not make a signifiant differene. Also foring the ell to learn to stabilize itself by adding an error Loss state = ε s with = {, 2, 3, 4} and different learning rates ε does not work. So we ied to hange the layout of the MD LSTM ell. 5. MD LSTM Stable Cell In Setion 3 we realized that D LSTM ells work good and the gradient does not exlode, but in the MD ase it does. Our idea is to ombine the revious states s d and take the D form of the LSTM ell. For this reason we all this ell LSTM Stable ell. s Therefore, a funtion s = f s,..., s D is needed, so that the following two benefits of the D LSTM ell remain:. The MD LSTM Stable ell has an NEG 2. The MD LSTM Stable ell allows NVG. The onvex ombination s = f s,..., s D = D d= λ d s d, d =,..., D : λ at date to one revious state d 0, D d= λ d = of all states satisfies these both oints see Theorems 7 and 8. To alulate these D oeffiients we want to use the ativation of D gates and we all them lambda gates LG or λ. Definition 6 MD LSTM Stable ell An MD LSTM Stable ell is a ell of dimension D and order where h = tanh and Γ = {ι, λ,,..., λ, D, φ, ω} s = g onv y Γ, s,..., s D = D s d d= 3 D d = y λ,d y λ,d

GUNDRAM LEIFERT ET AL. <7.2.206> s = g int y Γ, y in, s = y ι y in + s y φ y = g out y Γ, s = y ω h s Using these equations we an test the ell for its roerties. The MD LSTM Stable ell does not have the COD, beause the D LSTM ell also does not have this roerty. For the other roertiese we get: Theorem 7 LTD of MD LSTM Stable ells An MD LSTM Stable ell allows NVG. Proof See A.4 in aendix. Theorem 8 NEG of MD LSTM Stable ells An MD LSTM Stable ell has an NEG. Proof See A.5 in aendix. Reduing the number of gates by one. When D 2 an MD LSTM Stable ell has one more gate than a lassial MD LSTM for D = the both ells are equivalent. But it is ossible to redue the number of LGs by one. One solution is to hoose one dimension d {,..., D} whih does not get an LG. Its ativation is alulated by y λ,d = y λ,d. d {,...,D}\{d } In the seial ase of D = 2 we an hoose d = 2 and we get 2 d = y λ,d = y λ, + y λ, = and the udate equation of the internal state an be simlified to s = g int y ι, y λ,, y φ, s, s 2 = y ι y in + y λ, s + y λ, s 2. 6. Bounding the Internal State In the last setions we disussed the growing of the EC over time and we found a solution to have a NGEC for higher dimensions. Nevertheless it is ossible that the internal state grows linearly over time. When we take a look at Definition 0, we see that the artial derivative for = out deends on h s. So having the inequality y s h s with h s s 0 the ell allows NVG defined in Definition, but atually we have y out y in in s out 0 for arbiary gate ativations. Again, ideas like state deay, additional eehole onnetions or additional loss funtions like mentioned in Setion 4 either do not work or desoy the NVG of the LSTM and LSTM Stable ell. So, our solution is to hange the arhiteture of the MD LSTM Stable ell, so 4

CELLS IN MDRNNS <7.2.206> that it fulfills has an NEG and allows NVG and COD. The key idea is to bound the internal state, so that for all inuts y in, N D the internal state is bounded by s. Note that this is omarable with the well-known Bounded-Inut-Bounded-Outut-Stability BIBO- Stability. To reate an MD ell that has an NEG, allows NVG and has a bounded internal state, we take the MD LSTM Stable ell roosed in the last setion and hange its layout. Therefore we alulate the ativation of the IG as funtion of the FG, so that we ahieve s by hoosing y ι := y φ. So the ativation of the FG onols how muh leaks from the revious states. The ativation of the FG an also be interreted as swith, if the internal ativation, the new ativation or a onvex ombination of these both ativations should be stored in the ell. So the s an be seen as time-deendent exonential moving average of y in. Definition 9 MD Leaky ell An MD Leaky ell is a ell of dimension D and order where h = tanh and Γ = {λ,,..., λ, D, φ, ω} s = g onv y Γ, s,..., s D = D s d d= s = g int y Γ, y in, s = y φ y = g out y Γ, s = y ω h s D d = y λ,d y in + s Now we an rove that the resulting ell has all benefits. y λ,d Theorem 20 The MD Leaky ell has an NEG and allows NVG and COD. Proof See A.6 in aendix. The MD Leaky ell an have one gate less than the MD LSTM ell and the MD LSTM Stable ell and beause of this, the udate ath requires less omutations. 7. General Derivation of Leaky Cells So far we roosed ells for the MD ase, whih are able to rovide long term memory. But eseially in MDRNNs with more than one MD layer it is hard to measure if and how muh long term deendenies are used and even if it is useful. Another way to interret the ell is to onsider them as kind of MD feature exator like feature mas in Convolutional Neural Networks Bengio and LeCun, 995. Then the aim is to onsut an MD ell whih is able to generate useful features. Having a hierarhial Neural Network like in Bengio and LeCun 995 and Graves and Shmidhuber 2008 over the hierarhies the number of features inreases with a simultaneously dereasing feature resolution. Features in a layer with low resolution an be seen as low frequeny features in omarison to features in a layer with high resolution. So it would be useful to onsut a ell as feature exator whih rodues a low frequeny outut in omarison to its inut. In aendix B we take a loser look at the theory of linear shift invariant LSI-systems and their frequeny analysis and analyse a first order LSI-system regarding its free seletable arameters using the F- and Z-ansform. There, we derive the MD LeakyLP ell see Definition 2 and 5 additional first order MD ells, whih we do not test in Setion 8. y φ 5

GUNDRAM LEIFERT ET AL. <7.2.206> Definition 2 MD LeakyLP ell An MD LeakyLP ell is a ell of dimension D and order where h = tanh and Γ = {λ,,..., λ, D, φ, ω 0, ω } s = g onv y Γ, s,..., s D = D s d d= s = g int y Γ, y in, s = y φ D d = y λ,d y in + s y = g out y Γ, s, s = h s y ω0 + s y ω Setting the seond OG y ω to zero, the LeakyLP ell orresonds to the Leaky ell, hene it fulfills all three roerties, but has one more gate, whih is as muh gates as the LSTM ell. y λ,d y φ 8. Exeriments RNNs with D LSTM ells are well studied. In some exeriments the ativations of the gates and the internal state are observed and one an see that the ell an really learn, when to forget information and when the internal state should be aessible for the network see F. A. Gers and Shmidhuber, 2002. However, we did not find exeriments like these for the MD ase and we do not want to ansfer these exeriments into the MD ase. Instead we omare the different ell tyes with eah other in two senarios where the MD RNNs with LSTM ells erform very well. In both benhmarks the task is to ansribe a handwritten text on an image, so we have a 2D RNN. In this ase we omare the ells on the IFN/ENIT Pehwitz et al., 2002 and the Rimes database Augustin et al., 2006. Both tasks are solved with the MD RNN layout desribed in Graves and Shmidhuber 2008 and shown in Figure 4. All networks are ained with Bakroagation through time BPTT. To omare the different ell tyes in RNNs with eah other we take 0 RNNs with different weight initializations of eah ell tye and alulate the minimum, the maximum and the median of the best label error rate LER on a validation set of these 0 RNNs. In all tables we resent these three LERs to omare the ell tyes. We think it is more imortant to have stable ells in the lower MD layers beause of two reasons: First, when we have just a few ells in a layer, the saturation of one ell has a greater effet on the erformane of the network. Seond, in lower layers there are longer time series so having an unstable ell in suh a layer, it has time to saturate. So our first exeriment omares the reognition results when we substitute the LSTM ells in the lowest layer whih is 2D layer in Figure 4 by the newly develoed ells. In the seond exeriment we omare the LSTM ell and the LeakyLP ell also in the higher MD layers 2D layer 2 and 3 in Figure 4, to evaluate if the LeakyLP ell work better also in long time series. In Bengio 202, 3.. it is mentioned, that an imortant hyer arameter for a aining is the learning rate, so another exeriment is to ain all networks with stohasti gradient deent with different learning rates δ { 0 3, 5 0 4, 2 0 4, 0 4} and omare the best LER aording a fixed learning rate. 6

CELLS IN MDRNNS <7.2.206> OUTPUT LAYER 2D LAYER 3 0D LAYER 3 Subsamle 3 2D LAYER 2 0D LAYER 2 Subsamle 2 2D LAYER Subsamle INPUT LAYER Figure 4: Arhiteture of the hierarhial MDRNN used for the exeriments: It is equivalent to Graves and Shmidhuber 2008, Figure 2. A 2D layer ontains 2 2 distint layers for eah ombination of sanning diretion left/right and u/down one layer. To redue the number of weights between two 2D layers, a 0D layer is inserted, whih ontains units with tanh as ativation funtion. They have dimension 0 beause they have no reurrent onnetions. These layers an be seen as feed-forward or onvolutional layer. Eah 2D layer or its alloated 0D layer redues its size in in x and y dimension using a twodimensional subsamling. Simultaneously the number of feature mas z-dimension inreases to have no bottlenek between inut and outut layer. 8. The IFN/ENIT Database This database ontains handwritten names of towns and villages of Tunisia. The set is divided into 7 a-f,s sets, where 5 a-e are available for aining and validation for details see Pehwitz et al., 2002. With all information we got from A. Graves, we were able to get omarable results to Graves and Shmidhuber 2008. Therefor we divide the sets a-e into 30000 aining samles and 2493 validation samles. All network are ained 00 eohs with a fixed learning rate δ = 0 4. The LER is alulated on the validation set. 7

GUNDRAM LEIFERT ET AL. <7.2.206> Label-Error-Rate in Perent Celltye min max median LSTM 8,58% 4,73% 0,58% Stable 8,78%,75% 9,55% Leaky 8,87% 0,47% 9,0% LeakyLP 8,24% 9,40% 8,93% Table : Different ell tyes in the lowest MD layer Celltye in 2D layer Label-Error-Rate in Perent 2 3 min max median LSTM LSTM LSTM 8,58% 4,73% 0,58% LeakyLP LSTM LSTM 8,24% 9,40% 8,93% LeakyLP LeakyLP LSTM 8,35%,27% 8,9% LeakyLP LeakyLP LeakyLP 8,92%,69% 9,74% Table 2: Different ells in other layers 8.. DIFFERENT CELLS IN THE LOWEST MD LAYER In our first exeriment we substitute the LSTM ell in the lowest MD layer. We take some of the ells desribed in this aer. In Table the results are shown. The first row is the same RNN layout used in Graves and Shmidhuber 2008. We an see, that the LeakyLP ell erforms the best. Nevertheless the worst RNN with LeakyLP ells in the lowest MD layer erforms worth than the best RNN with LSTM ells. So we annot say, that LeakyLP is always better. But it an be observed that the variane of the RNN erformane is very high with LSTM ells in the lowest MD layer. Our interretation is that LSTM ells have a omarable erformane like the LeakyLP ells in the lowest layer, when they do not saturate. Note, that the Leaky ell has one gate less, so they are faster and have less ainable weights. 8..2 DIFFERENT CELLS IN OTHER MD LAYERS Now we want to omare the best new develoed ell the LeakyLP ell with the LSTM ell in the other MD layers. So we also substitute the LSTM ell in the uer MD layers. We enumerate the 2D layers like shown in Figure 4. In Table 2 we an see that substituting the LSTM ells only in the lowest or in the both lowest layer erform slightly better. The best results an be ahieved when we use LeakyLP ells in 2D layer and LSTM ells in 2D layer 3. Using LSTM in the middle layer seems to work slightly better than using the LeakyLP ells instead. This fits to our intuition mentioned before that the LSTM ells erform better when they do not have a too long time series and when there are enough ells in one layer whih do not saturate. 8..3 PERFORMANCE OF CELLS REGARDING LEARNING-RATE When we take a look at the udate equations and the roofs of the NEG it an be assumed, that the gradient going through the ells is lower for LeakyLP ells in onast to LSTM ells. So we think the learning rate have to be larger for LeakyLP ells. In Table 3 we omare the networks with 8

CELLS IN MDRNNS <7.2.206> Label-Error-Rate in Perent Celltye BP-delta min max median LSTM 0 4 8,58% 4,73% 0,58% LSTM 2 0 4 9,5% 6,86% 0,5% LSTM 5 0 4 9,03% 2,77%,44% LSTM 0 3 0,2% 30,20%,44% LeakyLP 0 4 8,92%,69% 9,74% LeakyLP 2 0 4 8,38% 9,09% 8,8% LeakyLP 5 0 4 8,25% 8,95% 8,78% LeakyLP 0 3 8,29% 9,20% 8,88% LeakyLP 2 0 3 8,95% 2,8% 9,55% Table 3: Performane of ells regarding learning-rate Label-Error-Rate in Perent Celltye min max median LSTM 4,96% 7,63% 6,50% Stable 4,45% 6,02% 5,% Leaky 4,77% 6,39% 5,85% LeakyLP 4,63% 5,78% 5,30% Table 4: Different ell tyes in the lowest MD layer either only LSTM or LeakyLP ells. There we an see that the learning rate have to be muh higher for the LeakyLP ells. In addition, the RNNs with LeakyLP ells are more robust to the hoie of the learning rate. 8.2 The Rimes Database One task of the Rimes database is the handwritten word reognition for more details see E. Grosiki and Geoffrois, 2008; Grosiki and El-Abed, 20. It ontains 59292 images of frenh single words. It is divided into distint subsets; a aining set of 4496 samles, a validation set of 7542 samles and a test set of 7464 samles. We ain the MD RNNs by using the aining set for aining and alulate the LER over the validation set, so the network is ained on 4496 aining samles eah eoh. The network used in this setion differs only in the subsamling rate between two layers from the network used in Graves and Shmidhuber 2008. When there is a subsamling between layers, the fators are 3 2 instead of 4 3 or 4 2. The rest of the exeriment is the same like desribed in Setion 8.. 8.2. DIFFERENT CELLS IN THE LOWEST MD LAYER In Table 4 we an see that substituting the LSTM in the lowest layer by one of the three ells imroves the erformane of the network, even the Leaky ell with one gate less. 9

GUNDRAM LEIFERT ET AL. <7.2.206> Celltye in 2D layer Label-Error-Rate in Perent 2 3 min max median LSTM LSTM LSTM 4,96% 7,63% 6,50% LeakyLP LSTM LSTM 4,63% 5,78% 5,30% LeakyLP LeakyLP LSTM 4,2% 5,57% 4,92% LeakyLP LeakyLP LeakyLP 4,94% 6,8% 5,52% Table 5: Different ells in other layers Label-Error-Rate in Perent Celltye BP-delta min max median LSTM 0 4 4,96% 7,63% 6,50% LSTM 2 0 4 4,4% 6,88% 5,6% LSTM 5 0 4 5,05% 6,27% 5,47% LeakyLP 0 4 4,94% 6,8% 5,52% LeakyLP 5 0 4 2,68% 3,95% 3,57% LeakyLP in 2D layer & 2 2 0 4 3,26% 4,04% 3,65% LeakyLP in 2D layer & 2 5 0 4 2,08% 3,42% 2,87% Table 6: Performane of ells regarding learning-rate 8.2.2 DIFFERENT CELLS IN OTHER MD LAYERS We want to see the effet of the substitution of the LSTM ell by the LeakyLP ell in the uer MD layers. In Table 5 we an see that using LeakyLP ells in both lowest layers erform very well. So we also take this setu to y different learning rates. Performane of Cells Regarding Learning-Rate. Using different learning rates we an see that the RNN with LeakyLP ells in the both lowest layers and the LSTM ells in the to layer an signifiantly imrove the erformane. Even the maximal LER of this RNN works better than the best network with LSTM ells in eah layer. 9. Conlusion In this aer we took a look at the one-dimensional LSTM ell and disussed the benefits of this ell. We found two roerties, that robably make these ells so owerful in the one dimensional ase. Exanding these roerties to the multi dimensional ase, we saw that the LSTM does not fulfill one of these roerties any more. We solved this roblem by hanging the arhiteture of the ell. In addition we resented a more general idea how to reate one dimensional or multi dimensional ells. We omare some newly develoed ells with the LSTM ell on two data sets and we an imrove the erformane using the new ell tyes. Due to this we think that substituting the multidimensional LSTM ells by the multi-dimensional LeakyLP ell ould imrove the erformane of many system working with a multi-dimensional sae. 20

CELLS IN MDRNNS <7.2.206> Aendix A. Proofs A. Proof of 7 Proof Let be a D LSTM ell. To get the derivative st 2 s t aording the unated gradient between two time stes t, t 2 N we have to take a look at g int. s t 2 s t = g int y Γ t 2, y t in 2, s t 2 s t = y in t 2 y ι t 2 s t } {{ } = 0 s t 2 = y φ t 2 s t = t 2 t=t + In addition, t N we have s t y in t = y ιt + s t 2 s t y φ t 2 + s t 2 y φt 2 s t }{{} = 0 2 y φ t 3 and y t s t = h s t y ω t. 4 We will rove the roerties suessively. NEG: For the LSTM ell the FG f φ = f log ensures y φ t 0,, so using these bounds in 3 with the LSTM ell has an NEG. NVG: Therefore, we hoose s t s t in = t t =t in + y φ t 0, { [ ε, if t = tin y ι t 0, ε] otherwise, { [ ε, if tin < t t y φ t out 0, ε] otherwise, with a later hosen ε > 0. Let t, t 2 N, t t 2 be two arbiary dates, where we want to alulate s the gradient t 2 y in t. First, we want to show that the LSTM ell allows NVG for t = t in and t in t 2 t out : We have y ι t [ ε, and t = t in +,..., t out : y φ t [ ε,. Then, we an estimate the derivative from 2 and 4 by s t 2 y in t = s t 2 s t s t y in t = y ι t [ ε t 2 t=t + t 2 t=t + y φ t ε, [ ε tout t in+,. 2

GUNDRAM LEIFERT ET AL. <7.2.206> To fulfill the equation for NVG we hoose ε deending on δ suh that δ ε tout t in+ ε δ t out t in + holds. Seond, we have to show, that the derivative is in [0, δ], when t = t in and t in t 2 t out is not fulfilled. In the ase of t t in when ε δ we an use the NEG whih leads to s t 2 y in t = s t 2 s t } {{ } s t y in t } {{ } 0,ε] [0, ε] [0, δ]. When t = t in we have two ases: t 2 < t in or t 2 > t out. For the ase t 2 < t in the derivative is zero [0, δ], beause the ell is ausal. For t 2 > t out we an slit the derivative at t out and get s t 2 y in t = y ιt t out t=t + y φ t } {{ } 0, t 2 t=t out+ 0, ε t 2 t out ] [0, ε] [0, δ]. y φ t }{{} 0,ε] { } t For ε min δ, δ out t in + the LSTM ell allows NVG. COD: To rove that the LSTM ell has no COD, we show that there are gate ativations suh that in Definition 5 we get δ 2 > δ. Therefore, we assume that all gate ativations are arbiary y γ t 0,, losed y γ t 0, ε] or oened y γ t [ ε, with a later hosen ε > 0. We take a look at the right side of 4. For s t = 0 we get h s t =. In Definition 5 we have to satisfy y Γ t : yt s [0, δ t 2] an hoose the OG y ω t 0, ε] with ε δ 2. 5 But then for t =,..., t we an hoose the IG and FG oen with the same ε so that y φ t, y ι t [ ε,. When for all time stes t =,..., t there is a ositive inut y in t [,, 0, R and an internal state s t < ε ε, the internal state is growing over time, beause s t = y in t y ι t + s t y φ t ε + s t ε s t + ε s t ε > s t + ε > s t. ε ε ε 22

CELLS IN MDRNNS <7.2.206> For large s t ε ε This yields in 4 to the bound so in Definition 5 we get we an estimate tanhs t ex s }{{} t ex ex 2s t ε. ε y t s t = h s t y ω t 6 ε ex 7 ε ε δ ex. 8 ε But when we ombine 5, 8 and the resition in Definition 5, we have ε ε δ 2 < δ ex, ε but there exist ε,, suh that the inequality is not fulfilled, whih is a onadition. Summarized, the D LSTM ell allows an NVG and has an NEG, but does not allow COD. A.2 Proof of 4 Proof Let be an MD LSTM ell of dimension D,,, 2, in, out N D, in out arbiary dates and h = tanh the sigmoid funtion. Besides ε > 0 is a later hosen value. In the first ste we want to show that there are ativations of the forget gates, so that s { [ ] ε in, for in out 9 [0, Dε] otherwise is fulfilled. The rove is done using indution over k = in with in. The base k = 0 is lear. Let be k. We define P := { d {,..., D} d in } the set of dimensions d, in whih are in - d -aths. Note, that this set annot be emty, beause > in for k. When we have a dimension d P then d in = k and we assume s d Then we hoose the ativations of the FG to be { [ ε y φ,d P, P for [0, ε] otherwise ] [ ] [ ε d in, = ε k,. 20 d P and in < out. 2 23

GUNDRAM LEIFERT ET AL. <7.2.206> Then we an estimate the derivative for in out using 20 and 2 to s s s = s d y φ,d d P d P [ P ε k ε P, P s d ε P P, d P s d P [ ε in,, 22 so 9 is fulfilled for in out. If we have < in in 9, the derivative is 0, beause we have a ausal system. For > out in 9, we hoose ε D P in 2 to ensure ND : s see 22 and we get s s d = y φ,d s d 0, Dε max d=,...,d s in 0, Dε], 23 d=,...,d and 9 is fulfilled. In the seond ste let 2 be the date, for whih we want to alulate the unated gradient s 2 y. We hoose the IG ativation as in y ι { [ ε, if = in 0, ε] otherwise 24 and we get s y in = y ι. Using 22, 23 and 24, we an estimate the artial derivative by s 2 y in s 2 s = s 2 s s y in { [ ] ε ε 2 in, for [0, Dε] otherwise = in and in 2 out. and setting { } δ ε := min D, δ in out + the onditions of Definition are fulfilled. A.3 Proof of 5 Proof Let be an MD ell of dimension D with the internal state s and in, k N D, in k two dates. Let k be a date k stes further in eah dimension than a fixed date in. So the distane 24

CELLS IN MDRNNS <7.2.206> between them is in k = Dk. Let Π be the set of all in - k -aths, then there exist Π = #{ in k } aths see Definition 8. We assume y φ,d [ε, ε] with ε 0, 0.5 and we an estimate the artial derivative, using the unated gradient, with i= s k = k π Π i= y π i φ,d [ ε k #{ in k }, ε k #{ in k } For D = we get Π = and the ell has a NGEC. When D 2 we an ount the number of aths using the Stirling s aroximation and we an estimate the number of aths with D #{ in k! i i= Dk! k 2πDk Dk Dk e DD Dk in k } = = D in k i! k! D 2πk k k D = D. 2πk e When we ombine it with the FG ativations we an estimate the derivative for great k with the Stirling s aroximation and get s k [ ε Dk #{ in k }, ε Dk #{ ] in k } 25 [ ] k D D D DεDk, D D εdk. 2πk 2πk The uer bound of this interval an grow for great k, if [D ε] > and this is the ase for D 2. So the MD LSTM ell an have an exloding gradient for D 2. When the weights to the FGs are initialized with small values, we have y φ,d 0.5. Then we have an exloding gradient when D 3, when the aining is starting. In the worst ase we have y φ,d and the derivative in 25 goes for great k to s k D D k D 2 D Dk. 2π ]. A.4 Proof of 7 Proof Let be an MD LSTM Stable ell of dimension D 2 for D = the roof is equivalent to the D ase of the LSTM ell,,, 2, in, out N D, in out arbiary dates and h = tanh the sigmoid funtion. Besides ε > 0 is a later hosen value. In the first ste we want to show that there are ativations of the forget gates, so that s { [ ] D ε 2 in, for in out 26 [0, ε] otherwise 25

GUNDRAM LEIFERT ET AL. <7.2.206> is fulfilled. The rove is done using indution over k = in. The base k = 0 is lear. Let be k. We define P := { d {,..., D} d in } the set of dimensions d, in whih are in - d -aths. Note, that this set annot be emty, beause > in for k. When we have a dimension d P then d in = k and we assume s d ] [ ] [ D ε 2 d in, = D ε 2k,. 27 When we hoose the ativations of the LGs to be y λ,d { [ ε, for d P and in < out 0, ε] otherwise, we an estimate Setting the FG to d P y λ,d D d = y λ,d D ε, ], beause d P y λ,d D = d = y λ,d d P y λ,d d P y λ,d + d {,...,D}\P y λ,d 28 P ε P ε + D P ε }{{} D P D ε P D ε + D ε P D ε P εd P D ε. y φ { [ ε, for in < out 0, ε] otherwise we an estimate the derivative for in out using 27,28 and 29 to s s = y φ d P s d y λ,d D d = y λ,d + s d y λ,d D d = y d {,...,D}\P λ,d }{{} ε D ε 2k D ε, D ε 2k, =0 29 30 26

CELLS IN MDRNNS <7.2.206> so 26 is fulfilled for in out. If we have < in in 26, the derivative is 0, beause we have a ausal system. For > out the FG is losed see 29, and using the uer bounds of 27 and 28 we get s D = y φ 0, ε] d= s d y λ,d D 3 d = y λ,d and 26 is fulfilled. In the seond ste let 2 be the date, for whih we want to alulate the unated gradient s 2 y. We hoose the IG ativation as in y ι { [ ε, if = in 0, ε] otherwise 32 and we get s y in = y ι. Using 30, 3 and 32, we an estimate the artial derivative by s 2 y in s 2 s = s 2 s s y in { [ ] ε D ε 2 2 in, for [0, ε] otherwise = in and in 2 out. and setting { ε := min δ, δ 2 in out + D } the onditions of Definition are fulfilled. A.5 Proof of 8 Proof Let be a MD LSTM Stable ell of dimension D with the internal state s and in, N D, in two arbiary dates and in = k. Let all gate ativations be arbiary in [0, ]. We show that s [0, ] 33 is fulfilled k N using indution over k. For the base ase k = 0 we get Let 33 be fulfilled for k. That means if d d leads to s [0, ]. If d s = s in =. in we have d in = k and this in then there is no in - s d -ath and we have 27 d = 0 for this

GUNDRAM LEIFERT ET AL. <7.2.206> dimension. Then we an alulate the derivative 0 s D = y s d y λ,d φ D d= [0, ], whih gives us the desired interval. d = y λ,d [ 0, max d P { s d D d = y λ,d y λ,d } {{ } }] A.6 Proof of 20 Proof NEG: The ell has an NEG, beause all gates have the same bounds as the MD Stable ell. NVG: To rove the NVG, we use the roof of Theorem 7. The differene between the MD Stable ell and the MD Leaky ell is that the ativations of the FG and IG are deendent on eah other for the Leaky ell. Let in, N D, in be two arbiary dates like in Theorem 7. The IG has just the a resition that for = in it has to hold y ι [ ε,. Here, the FG an have an arbiary ativation, so we hose y φ = y ι. For all > in the FG have to be in the ranges, shown in 29, while the IG has no resition and we hoose y ι = y φ, so the MD Leaky ell has the NVG. COD: The roof that the MD Leaky ell allows COD an be done by estimating the bounds of s. From the udate equations of the ell we get s max s d. i=,...,d Now we an estimate the internal state using the ranges y in [, ], reursion over { } s = y φ y in + y φ s max y in, s,..., { s D max y in q } and get s [, ]. To fulfill the derivatives in Definition 2, for δ we hoose y ω [ ε, and get For δ 2 we hoose y ω 0, ε] and get δ min s { h s } ε = h ε. 34 q< δ 2 max s { h s } ε = ε. 35 To fulfill the derivatives in Definition 2 we use 34, 35 and h > 3 and with the COD is roven. ε δ 2 < δ h ε ε 4 < h h + 28