Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections

Size: px
Start display at page:

Download "Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections"

Transcription

1 Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections Different methods have been suggested to solve either the vanishing or exploding gradient problem. The LSTM has been specifically designed to help with the vanishing gradient (Hochreiter & Schmidhuber, 1997). This is achieved by using gate vectors which allow a linear flow of information through the hidden state. However, the LSTM does not diarxiv: v1 [cs.lg] 1 Dec 216 Zakaria Mhammedi The University of Melbourne, Parkville VIC 31, Australia Andrew Hellicar CSIRO, Sandy Bay, TAS, Australia Ashfaqur Rahman CSIRO, Sandy Bay, TAS, Australia James Bailey The University of Melbourne, Parkville VIC 31, Australia Abstract Recurrent Neural Networks (RNNs) have been successfully used in many applications. However, the problem of learning long-term dependencies in sequences using these networks is still a major challenge. Recent methods have been suggested to solve this problem by constraining the transition matrix to be unitary during training, which ensures that its norm is exactly equal to one. These methods either have limited expressiveness or scale poorly with the size of the network when compared to the simple RNN case, especially in an online learning setting. Our contributions are as follows. We first show that constraining the transition matrix to be unitary is a special case of an orthogonal constraint. Therefore, it may not be necessary to work with complex valued matrices. Then we present a new parametrisation of the transition matrix which allows efficient training of an RNN while insuring that the transition matrix is always orthogonal. Using our approach, one online gradient step can, in the worst case, be performed in time complexity O(T n 2 ), where T and n are the length of the input sequence and the size of the hidden layer respectively. This time complexity is the same as the simple RNN case. Finally, we test our new The Department of Computing and Information Systems. Data61. ZMHAMMEDI@STUDENT.UNIMELB.EDU.AU ANDREW.HELLICAR@DATA61.CSIRO.AU ASHFAQUR.RAHMAN@DATA61.CSIRO.AU JAMES.BAILEY@UNIMELB.EDU.AU parametrisation on problems with long-term dependencies. Our results suggest that the orthogonal constraint on the transition matrix has similar benefits to the unitary constraint. 1. Introduction Recurrent Neural Networks (RNNs) have been successfully used in many applications involving time series. This is because RNNs are well suited for sequential data as they process inputs one element at a time and store relevant information in their hidden state. In practice, however, training simple RNNs can be challenging due to the problem of exploding and vanishing gradients (Hochreiter et al., 21). It has been shown that the exploding gradient problem is likely to occur when the transition matrix of an RNN has a spectral norm larger than one (Glorot & Bengio, 21). This results in an error surface, associated with some objective function, having very steep walls (Pascanu et al., 213). Landing on one of these walls during training will result in a very large gradient step, which can disrupt training. On the other hand, when the spectral norm of the transition matrix is less than one, the information at one time step tend to vanish quickly after a few time steps. This makes it challenging to learn long-term dependencies in sequential data.

2 rectly address the exploding gradient problem. In fact, the longer the input sequence is, the more prone the LSTM is to the exploding gradient. One approach to solving this issue is to clip the gradients (Pascanu et al., 213) when their norm exceeds some threshold value. This has the negative side effect of adding an additional hyperparameter to the model. Furthermore, if the exploding gradient occurs within some parameter search space, the associated error surface will still have steep walls. This makes training challenging even with gradient clipping. Another way to approach this problem is to improve the shape of the error surface directly by making it smoother, which can be achieved by constraining the spectral norm of the transition matrix to be less than or equal to one. However, a value of exactly one is best for the vanishing gradient problem. A good choice of the activation function between hidden states is also crucial in this case. These ideas have been investigated in recent work. In particular, the unitary RNN (Arjovsky et al., 215) constrains the transition matrix to be in the unitary group U(n) using a special parametrisation. This ensures that the norm of the transition matrix is exactly equal to one. This parametrisation and other similar ones (Hyland & Rätsch, 216; Wisdom et al., 216) have some advantages and drawbacks which we will discuss in more details in the next section. The main contributions of this work are as follows: We first show that, constraining the search space of the transition matrix of an RNN to the unitary group U(n) is equivalent to limiting the search space to a subset of the orthogonal group O(2n) of a new RNN with twice the hidden size. This suggests that it may not be necessary to work with complex valued matrices. We present a simple and efficient way to parametrise orthogonal transition matrices. We show that the time and space complexities of using this parametrisation are, in the worse case, the same as those of the simple RNN. This new parametrisation can be used in other types of neural networks, not just the recurrent ones. 2. Related Work Throughout this work we will refer to elements of the following simple RNN architecture. h t = φ(w h t 1 + V x t ), (1) o t = Y h t, (2) where W, V and Y are the hidden-to-hidden, input-tohidden, and hidden-to-output weight matrices. h t 1 and h t are the hidden vectors at time steps t 1 and t respectively. Finally, φ is a non-linear activation function. We have omitted the bias terms for simplicity. When looking at the structure of the transition matrix W, recent research has explored how their initialisation influences training and the ability to learn long-term dependencies. In particular, initialising with specific structures such as the identity or an orthogonal matrix can greatly improve performance (Le et al., 215). In addition to these initialisation methods, one study has also considered removing the non-linearity between the hidden-to-hidden connections (Henaff et al., 216), i.e. the term W h t 1 in Equation 1 is outside the activation function φ. This method showed good results when compared to the LSTM on pathologically synthetic problems exhibiting long-term dependencies. After training a model for a few iterations using gradient descent, nothing guarantees that the initial structures of the transition matrix will be kept. In fact, its spectral norm can deviate from one, and the exploding and vanishing gradients can be a problem again. It is possible to constrain the transition matrix to be orthogonal during training using more sophisticated methods (Wisdom et al., 216). However, this comes at the cost of time-complexity. A naive approach is to use gradient descent and project the updated transition matrix onto the orthogonal group using the QR decomposition for example. This incurs a O(n 3 ) cost at each iteration, where n is the size of the hidden layer. This can be an issue when the batch size and the sequence length T are small compared to n. Recently, new parametrisations of the transition matrix have been suggested (Arjovsky et al., 215), which ensure that its spectral norm is always equal to one. This condition is sufficient to ensure that exploding gradients do not occur. Methods such as gradient clipping are made unnecessary. The unitary RNN (urnn) is one example where the hidden matrix W C n n is the product of elementary matrices, consisting of reflection, diagonal, and Fourier transform matrices. When the size of hidden layer is equal to n, the transition matrix has a total of only 7n parameters. Another advantage of this parametrisation is computational efficiency. In particular, the matrix-vector product W v, for some vector v, can be calculated in time complexity O(n log n). However, it has been shown that this parametrisation does not allow the transition matrix to span the full unitary group (Wisdom et al., 216), which may limit the model expressiveness. Another interesting parametrisation (Hyland & Rätsch, 216) has been suggested which takes advantage of the algebraic properties of the unitary group U(n). The idea is to use the corresponding matrix Lie algebra u(n) of skew hermitian matrices. In particular, the transition matrix can

3 be written as follows: n 2 W = exp λ i T i, (3) i=1 where exp is the exponential matrix map and {T i } n2 i=1 are predefined n n matrices forming a bases of the Lie algebra u(n). The learning parameters are the weights {λ i }. The fact that the matrix Lie algebra u(n) is closed and connected ensures that the exponential mapping from u(n) to U(n) is surjective. Therefore, with this parametrisation the search space of the transition matrix spans the whole unitary group. This is one advantage over the original unitary parametrisation (Arjovsky et al., 215). However, the cost of computing the matrix exponential in Equation (3) is O(n 3 ), where n is the dimension of the hidden state. Evaluating n 2 i=1 λ it i can also be computationally expensive. A more recent method (Wisdom et al., 216) performs optimisation directly of the Stiefel manifold. The corresponding model was called full-capacity unitary RNN. Using this approach, the transition matrix can span the full unitary group. However, this method involves a matrix inverse as well as matrix-matrix products which have time complexity O(n 3 ). This can be problematic when the size of the hidden layer is large. This method is somewhat similar to the projection approach discussed earlier. All the methods discussed above, except for the original unitary RNN, involve a step that requires at least a O(n 3 ) time complexity. Furthermore, they all require the use of complex matrices. Table 2 summarises the time complexities of various methods, including our approach, for one online gradient step. In the next section, we show that imposing a unitary constraint on a transition matrix W C n n is equivalent to imposing a constraint in a subset of the orthogonal group O(2n) on a new RNN with twice the hidden size. Furthermore, since the norm of orthogonal matrices is also always equal to one, using the latter has the same theoretical advantages as using unitary matrices when it comes to exploding gradient problem. 3. Complex unitary versus orthogonal In is possible to show that when the transition matrix W C n n of an RNN is unitary, there exists an equivalent representation of this RNN with an orthogonal transition matrix Ŵ R2n 2n. In fact, consider a complex unitary transition matrix W = A + ib C n n, where A and B are now real valued matrices in R n n. Furthermore, consider the following new variables t, ĥt = [ ] [ ] R (ht ) R (V ), ˆV =, Ŵ = I (h t ) I (V ) [ A B B A ], where R and I denote the real and imaginary parts of a complex number. Note that ĥt R 2n, Ŵ R 2n 2n, ˆV R 2n nx, where n x is the dimension of the input vector x t in Equation (1). Assuming that the activation function φ applies to the real and imaginary parts separately, it is easy to show that the update equation of the complex hidden state h t of the unitary RNN can be written in the real space as follows ĥ t = φ(ŵ ĥt 1 + ˆV x t ). (4) Even when the activation function φ does not apply to the real and imaginary parts separately, it is still possible to find an equivalent real space representation. Take for example the activation function proposed by (Arjovsky et al., 215) { ( z + b) z σ modrelu (z) = z, if z + b > (5), otherwise where b is a bias vector. For a hidden state ĥ R2n, the equivalent activation function in the real space representation (4) is given by φ(ĥi) = ĥ2 i +ĥ2 k i +b k i ĥ2 i +ĥ2 ĥ2 i + ĥ2 k i + b k i > ĥ i, if k i, otherwise where k i = (i + n) mod 2n and k i = i mod n for all i {1,..., 2n}. This activation is no longer applied to hidden units independently. Now we will show that the matrix Ŵ is orthogonal. By definition of a unitary matrix, we have W W = I where the symbol indicates the complex conjugate. This equation implies that (6) A 2 + B 2 = I, (7) BA AB =. (8) On the other hand, we have [ Ŵ Ŵ T A = + B 2 BA AB ] AB BA B 2 + A 2. (9) Using Equations (7) and (8), it follows that Ŵ Ŵ T = I. Also note that the matrix W has a special structure. It is a block-matrix and its blocks are the matrices A, B, A, and B. The discussion above shows that using a complex unitary matrix in C n n is equivalent to using an orthogonal matrix, belonging to a subset of O(2n), in a new RNN with twice the hidden size. This is why in this work we focus on parametrising orthogonal matrices directly.

4 Methods Constraint on the Time complexity of one Search space of the transition matrix online gradient step transition matrix urnn (Arjovsky et al., 215) W = 1 O(T n log(n)) A subset of U(n) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Wisdom et al., 216) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Hyland & Rätsch, 216) ornn W = 1 O(T nm) The full O(n) group (Our approach) where m n when m = n Table 1. Table showing the time complexities, associated with one gradient step. for different methods when the size of the hidden layer of an RNN is n and the input is one sequence of length T. In the next section, we present an efficient, and flexible parametrisation of the transition matrix which allows it to span the full orthogonal group. 4. Parametrisation of the transition matrix Before discussing the details of our parametrisation, we need to introduce a few notations. For n N and 2 k n, we define the mappings H k : R k R n n [ ] In k u I k 2uu T, (1) where I k denotes the k-dimensional identity matrix. Note that, when v = 1, H k (u) is simply the Householder Reflection in R n n, which is orthogonal. For the special case of k = 1, we define H 1 as H 1 : R R n n [ ] In 1 u, u (11) where u is simply a scalar in this case. Note that, when u {1, 1}, H 1 (u) is orthogonal. We propose to parametrise the transition matrix of an RNN using the vectors {u i }, which we will refer to as the reflection vectors in what follows. This parametrisation has many advantage as the following theorem will help us show. Theorem 1. Let n N and 1 k n we define the mapping M k as follows M k : R k R n R n n (u k,..., u n ) H n (u n )... H k (u k ), where the functions {H k } n k=1 are defined in Equations (1) and (11). Then we have, The mappings M k are smooth for all 1 k n. The image of M 1 includes the orthogonal group O(n), i.e. O(n) M 1 [R R n ]. Sketch of the proof. The first point of Theorem 1 is straightforward. The mapping M k is the product of n k + 1 smooth maps. Therefore, M k is also a smooth map. Now we need to the show that for every Q O(n), there exits a tuple of vectors (u 1,..., u n ) R R n such that Q = M 1 (u 1,..., u n ). Algorithm 1 show how a QR decomposition can be performed using the matrices {H k (u k )} n k=1 while insuring that the upper triangular matrix R has positive diagonal elements. If we apply this algorithm to an orthogonal matrix Q, we get a tuple (u 1,..., u n ) which satisfies QR = H n (u n )... H 1 (u 1 )R = Q. Note that the matrix R must be orthogonal since R = Q T Q. Therefore, R = I, since the only upper triangular matrix with positive diagonal elements is the identity matrix. Hence, we have M 1 (u 1,..., u n ) = H n (u n )... H 1 (u 1 ) = Q. When using m reflection vectors, the parametrisation can be expressed as W = M n m+1 (u n m+1,..., v n ) = H n (u n )... H n m+1 (u n m+1 ), (12) This parametrisation has the following advantages The parametrisation is smooth, which is convenient for training with gradient descent.

5 Algorithm 1 QR decomposition using the mappings {H k } n k=1. For a matrix R Rn n, {R k,k } 1 k n denote its diagonal elements, and R k..n,k R n k+1 is a vector representing the part of the k-th column below the k-th row (including the element of the k-th row ). Require: A R n n is a full-rank matrix. Ensure: Q and R where Q = H n (u n )... H 1 (u 1 ) and R and is upper triangular with positive diagonal elements such that A = QR R A Q I {Initialise Q to the identity matrix} for k = 1 to n 1 do if R k,k == R k..n,k then u n k+1 = [,...,, 1] T R n k+1 else u n k+1 R k..n,k R k..n,k [1,,..., ] T u n k+1 u n k+1 / u n k+1 end if R H n k+1 (u n k+1 )R Q QH n k+1 (u n k+1 ) end for u 1 = sgn(r n,n ) R R H 1 (u 1 )R Q QH 1 (u 1 ) When m = n, the transition matrix can span the whole orthogonal group O(n) The only thing needed for the matrix W to be orthogonal is for the vectors {u i } to have norm one. To ensure that this is always the case, one method would be to project the vectors onto the unit ball after each gradient step. This would take O(nm) time complexity. The time and space complexities involved in one gradient step are, in the worst case, the same as that of the simple RNN with the same number of hidden units. This is mainly due to the fact that the transition matrix W does not need to be generated explicitly. In fact, one only needs the matrix-vector products W v, which are cheaper to compute using our parametrisation. The parametrisation is flexible. A good trade-off between expressiveness and speed can be found by selecting the right number of reflection vectors. When the goal is to ensure that the transition matrix W can span the whole orthogonal group, the number of reflection vectors and hidden units be the same. In this case, the number of redundant parameters associated with the transition matrix W is small. In fact, when m = n the total number of parameters needed for W is n(n + 1)/2. There are only n redundant parameters in this case, since the orthogonal group O(n) is a n(n 1)/2 manifold. Note that H 1 (u 1 ) is not the standard Householder reflection. The second point of Theorem (i.e. the subjectivity) would not be true had we used a Householder reflection instead Time complexity When using the parametrisation in (12), we do not need to calculate the matrix W explicitly. We only need the matrix-vector products W v. The cost of calculating W v is proportional to O(nm), where m and n are the number of reflection vectors used and the size of the hidden state respectively. Consequently, given a sequence of length T, the time complexity of Forward Propagation (FP) is O(T mn). Back Propagation Through Time (BPTT) can be achieved within the same time complexity as FP. Note that for a simple RNN the complexity of FP and BPTT are equal to O(T n 2 ) for one sequence of length T. Therefore, training an RNN using our parametrisation incurs, in the worst case, the same time complexity as the simple RNN. The worst case being when the number of reflection vectors is equal to the size of the hidden state Space complexity Parametrising the transition matrix of an RNN using (12) with m reflections is equivalent to adding m 1 intermediate hidden states between any two consecutive time steps. For time step t, these vectors are h t,k = H n m+k (u n m+k )... H n m+1 (u n m+1 )h t, (13) for 1 k m 1, with the conventions h t,m = h t+1, = h t+1. Using an algorithm such as BPTT to calculate the gradients, requires the values of the hidden states, including the intermediate ones, in order to propagate the gradients. A naive approach would be to store the values of h t,k for all t T and 1 k m 1. This method would increase the memory storage by a factor of m compared to the case of a simple RNN. This may be acceptable if m is fixed irrespective of the size of the hidden state n. However, when m = n, it is clearly a problem. In order to circumvent this issue, note that the h t,k for 1 k m 1 do not need to be explicitly stored during the forward propagation. In fact, they can be calculated on the fly during BPTT, which would only add O(mn) operations at each time step. In fact, from Equation (13), the only ingredients needed to generate the intermediate hidden states at time step t are the reflection vectors {u i } m i=1 and the hidden state h t. Note that having to generate the intermediate vectors locally does not affect the overall time complexity of the BPTT algorithm. Adding O(mn) operations at each time step still results in an overall complexity

6 of O(T mn) for one sequence of length T. However, unless one implements the BPTT algorithm from scratch, it would be challenging to modify any pre-existing library to manage memory in the way needed for our parametrisation Compact WY representation In order to make the implementation of our method feasible we need to derive the expanded mathematical expressions of the following gradients with respect to some scalar loss function L [ ] T (W h) U = U (W h), (14) h = [ (W h) h ] T (W h), (15) where W is the transition matrix and U = (u n... u n m+1 ) is the matrix whose columns are the reflection vectors. Furthermore, we need to be able to compute these gradients in a O(mn) time complexity in order to achieve the desired global time and space complexities discussed in the previous subsections. Having the explicit expressions of U, h will allow an implementation that does not require the storage of any intermediate hidden states for BPTT. Note that in the case of an RNN, U represents the immediate gradient at any time step t. In which case, the hidden states {h k } k t are affected by small perturbations of U only through the hidden state h t. In order to write the mathematical expressions of the gradients in Equations (14) and (15), we will use the compact WY representation (Joffrain et al., 26) of the product of Householder Reflections, which is described by the following proposition. Proposition 1. (Joffrain et al., 26) Let n N and m n 1. For n m + 1 i n, let v i R n i+1 and U = (u n... u n m+1 ) be the matrix whose columns are the {v i } vectors. Then H n (u n ),..., H n m+1 (u n m+1 ) = I n UT 1 U T, (16) where T = striu(u T U) diag(u T U), (17) with striu(u T U), and diag(u T U) representing the strictly upper part and the diagonal of the matrix U T U, respectively. This is the compact WY representation. For the particular case of m = n, in Proposition 1. The compact WY representation can be written as H n (u n )... H 1 (u 1 ) = ( I n UT 1 U T ) H 1 (u 1 ), (18) where U = (u n... u 2 ), and H 1 (u 1 ) is defined in Equation (11). Theorem 2. Let m n 1 and W = I UT 1 U T, where U = (u n... u n m+1 ). Let J n R n n be the matrix of all ones, and L be a scalar loss function. We have [ U =U ( h C T ) B T + ( C h ] T ) B C h T h C T, (19) h =C U C, (2) where C = W h, C = C, C = T T U T C, h = T 1 U T h, B = striu(j n ) I n, and denote the Hadamard product. The gradients U, and h as defined in Equations (19) and (2) can be calculated in O(mn) time and space complexities. The proof of Equations (19) and (2) is provided in the appendix. Algorithms 2 and 3 show routines which can calculate T 1 v and T T v in time complexity O(mn) for any vector v. Algorithm 4 shows a routine which calculates U [ (vw T ) B T + (wv T ) B ] for any vectors v and w, also in time complexity O(mn). Using these routines the gradients U and h can be calculated in time complexity O(mn). Algorithm 2 Calculating T 1 v for any vector v. matrix U R n m, U,k denotes its k-th column. For a Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T 1 v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = m to 1 do o k 2(v k U T,k t)/ U,m t t + o k U,k end for Algorithm 3 Calculating T T v for any vector v. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T T v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = 1 to m do o k 2(v k U T,k t)/ U,m t t + o k U,k end for

7 Algorithm 4 Calculating U [ (vw T ) B T + (wv T ) B ] for any vector v and w in R m. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v, w R m. Ensure: Z = U [ (vw T ) B T + (wv T ) B ] R n m, where o R n Z n,m R n m {initialising Z with zeros} t [,..., ] T R n {temporary vector} l [,..., ] T R n {temporary vector} for k = 1 to m 1 do t t + U,k v k Z,k+1 Z,k+1 + w k+1 t l l + U,m k+1 w m k+1 Z,m k+1 Z,m k+1 + v m k+1 l end for l l + U,1 w 1 Z,1 Z,1 + v 1 l Orthogonal Parametrisation of Recurrent Neural Networks We fixed the length T = 1 of the input sequence and varied the number of hidden units n of the ornn in the set {128, 256, 512, 124, 248} (see Figure 1(a)). The number of reflection vectors was equal to the hidden size (i.e m = n). Our method scales in O(n 2 ), compared with O(n 3 ) for the naive approach as expected. Our method is faster than the later when n 25. Here, we fixed the length of the input sequence to T = 1, and the number of hidden units to n = 248 and we varied the number of reflections m = {128, 256, 512, 124, 248} (See Figure 1(b)). Our method scales in O(m). The last case consisted of fixing the number of hidden units and reflections to n = m = 128 and varying the length of the input sequence in {64, 128, 256, 512, 124, 248, 496}. The computational time increased linearly with the length of the sequence (See Figure 1(c)). This first experiment, confirms that, using our approach, BPTT can be performed in O(T mn) time complexity for an input sequence of length T. 5. Experiments We have implemented and tested our method using the python library theano (Theano Development Team, 216). We implemented a tensor operator F that takes a matrix W and a vector h and returns the product W h. In order to enable automatic differentiation, we also implemented a gradient method associated with the operator F which calculates the gradients in Equations 19 and 2, using Algorithms 2, 3, and Scalability In this first experiment we tested the scalability of our method by measuring the time required for one gradient calculation. The input consisted of one randomly generated 2d sequence. The target output was a fixed scalar value at the end of the sequence. We compared our method to a naive approach where the orthogonal matrix is generated explicitly before the FP using the compact representation W = I UT 1 U T. Note that even though the matrix T is triangular and its inverse can be calculated in time complexity O(m 2 ), generating W would still require O(nm 2 ) complexity because of the matrix-matrix products involved. Our method, on the other hand, does not require the explicit computation of W. We considered the following situations: 5.2. Application to problems with long-term dependencies We have tested the new parametrisation on the three different datasets all having long-term dependencies. We will refer to the RNN parametrised using the Householder reflections as the ornn. We set its activation function to the leaky Rectified Linear Unit (Relu) defined as { x, if x > φ(x) = (21).1x, otherwise For all experiments, we used the adam method for stochastic gradient descent. We tested different learning rates in the set {1 1, 1 2, 1 3, 1 4 } for every model and selected the ones which resulted in the fastest convergence on the training sets. We initialised all the parameters using uniform distributions similar to (Arjovsky et al., 215). The biases of all model were set to zero, except for the forget bias of the LSTM, which we set to 5 in order to facilitate the learning of long-term dependencies (Koutník et al., 214). To ensure that the transition matrix of the ornn is always orthogonal, we projected each reflection vector u i onto the unit i-sphere after each gradient iteration SEQUENCE GENERATION In this experiment, we followed a similar setting to (Koutník et al., 214) where we trained RNNs to encode song excerpts. We used the track Manyrista from album Musica Deposita by Cuprum. We extracted five consecutive excerpts around the beginning of the song, each having 8 data points and corresponding to 18ms with a 44.1Hz sampling frequency. We trained an srnn, LSTM,

8 NMSE Time t in seconds Time t in seconds Time t in seconds Orthogonal Parametrisation of Recurrent Neural Networks 1 2 Varying hidden size n 1 2 Varying # of reflections m A t ~ m Varying sequence length T 1 1 t ~ n 3! A t ~ n A t ~ m t ~ T! A t ~ T optimised naive n (a) Time required for one gradient step when the input is one two-dimensional sequence of length T = 1. The size of the hidden layer n takes values in {128, 256, 512, 124, 248}. 1 optimised naive m (b) Time required for one gradient step when the size of the hidden layer n and the length of the input sequence are fixed to 248 and 1 respectively. The number of reflection m takes values in {128, 256, 512, 124, 248}. 1-1 optimised T (c) Time required for one gradient step when the size of the hidden layer n is fixed to 128. The input is one 2d sequence of length T {64, 128, 256, 512, 124, 248, 496}. Figure 1. Scalability tests. and ornn for 5 epochs on each of the pieces with five random seeds. For each run, the lowest Normalised Mean Squared Error (NMSE) during the 5 epochs was recorded. For each model, we tested three different hidden sizes. The total number of parameters corresponding to these hidden sizes was approximately equal to 25, 5, and 1. For the ornn, we chose the number of reflection vectors to be equal to the hidden size for each case, so that the transition matrix is allowed to span the full orthogonal group. The results are shown in Figures 2 and 3. All the learning rate were set to.1. The orthogonal parametrisation outperformed the simple RNN and performed on average better than the LSTM on the sequence encoding task PIXEL MNIST We used the MNIST image dataset in this experiment. We split the dataset into train set (55 instances), validation (5 instances), and test set (1 instances). We trained an ornn with n = 128 hidden units and m = 16 reflections, which corresponds to approximately 36 parameters, to minimise the cross-entropy. We set the batch size to 64, and we selected the best model based on the lowest validation cost after 2 epochs. The later was evaluated at every 2 training iterations. All the learning rates were set to.1 in this experiment. Table compares our test performance against the urnn results available in the literature. The models we used for comparison all have about 16K parameters, which is more than four time the number of parameters of our tested model. Despite this, our model still achieved a 95.6% test accuracy which is the second best result LSTM # param ornn # param srnn # param Figure 3. Results summary for the sequence generation task. The plots show the NMSE distribution for the different models with respect to the total number of parameters. The horizontal red lines represent the medians of the NMSE over 25 data points (i.e. five seeds for each of the five song excerpts). The solid rectangles and the dashed bars represent the [25% 75%] and [9% 91%] confidence intervals respectively.

9 1 LSTM ornn srnn Figure 2. The RNN generated sequences against the true data for two of the five excerpts used. We only displayed the best performing models for to total number of parameters N p 1. This corresponds to the hidden sizes 15, 3, and 41 for the LSTM, srnn, and ornn respectively. The number of reflections was set to m = n for the ornn. The srnn showed the worst performance. Model hidden size Number of validation test (# reflections) parameters accuracy accuracy ornn 128 (m=16) 3.6K 96.8 % 95.6 % urnn (Arjovsky et al., 215) K % RC urnn (Jia, 216) K % FC urnn (Jia, 216) K % Table 2. Results summary for the MNIST digit classification experiment, and comparison with the urnn results available in the literature. FC and RC stand for Full-Capacity and Restricted Capacity respectively ADDING TASK In this experiment, we follow a similar setting to (Arjovsky et al., 215), where the goal of the RNN is to output the sum of two elements in the first dimension of a 2d sequence. The location of the two elements to be summed are specified by the entries in the second dimension of the input sequence. In particular, the first dimension of every input sequence consists of random numbers between and 1. The second dimension has all zeros except for two elements which are equal to 1. The first unit entry is located in the first half of the sequence, and the second one in the second half. We tested two different lags T = 4 and T = 8. All models were trained to minimise the Mean Squared Error (MSE). The baseline MSE for this task is.167 corresponding to a model that always outputs 1. We trained an ornn with n = 128 hidden units and m = 16 reflections similar to the MNIST experiment. We trained an LSTM and an srnn with hidden sizes 28 and 54, respectively, corresponding to a total number of parameters 36 (i.e. same as the ornn model). We chose a batch size of 5, and after each iteration, a new set of sequences was generated randomly. The learning rate for the ornn was set to.1. Figure 5(a) and 5(b) displays the results. The ornn was able to beat the baseline MSE in less than 5 iterations for both lags and for two different random initialisation seeds. This is in line with the results of the unitary RNN (Arjovsky et al., 215). 6. Discussion In this work, we presented a new efficient way to parametrise transition matrices of a recurrent neural network using Householder reflections. This method allows an easy and computationally efficient way to enforce an orthogonal constraint on the transition matrix which then ensures that exploding gradients do not occur during training. We have also shown that enforcing a unitary constraint on the transition matrix is a special case of the orthogonal one. Experimental results show that the orthogonal parametrisation has similar advantages to the unitary one. Acknowledgment The authors would like to acknowledge Department of State Growth Tasmania for funding this work through SenseT. References Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. arxiv preprint arxiv: , 215. Giles, Mike B. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. 28. Glorot, Xavier and Bengio, Yoshua. Understanding the dif-

10 Cross entropy MSE Accuracy MSE Orthogonal Parametrisation of Recurrent Neural Networks 1.8 Convergence plot (Accuracy).5.4 T = 4 LSTM srnn ornn ornn epochs (a) MNIST classification experiment. Validation accuracy of the ornn with n = 128 and m = 16 as a function of number of epochs Convergence plot (Cross entropy) ornn x 1 iterations (a) T = 8 LSTM srnn ornn epochs (b) MNIST classification experiment. Cross-entropy on the validation set of the ornn with n = 128 and m = 16 as a function of number of epochs. Figure 4. Convergence plots of the MNIST classification experiment x 1 iterations (b) Figure 5. Results of the addition task. For each lag T, the red curves represent two runs of the ornn model with two different random initialisation seeds. Notice that the LSTM and ornn did not beat the baseline MSE of.167.

11 ficulty of training deep feedforward neural networks. In Aistats, volume 9, pp , 21. Henaff, Mikael, Szlam, Arthur, and LeCun, Yann. Orthogonal rnns and long-memory tasks. arxiv preprint arxiv: , 216. Hochreiter, Sepp and Schmidhuber, Juergen. Long Short- Term Memory. 9(8): , Hochreiter, Sepp, Bengio, Yoshua, Frasconi, Paolo, and Schmidhuber, Jürgen. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 21. Hyland, Stephanie L and Rätsch, Gunnar. Learning unitary operators with help from u (n). arxiv preprint arxiv: , 216. Jia, Kui. Improving training of deep neural networks via singular value bounding. arxiv preprint arxiv: , 216. Joffrain, Thierry, Low, Tze Meng, Quintana-Ortí, Enrique S, Geijn, Robert van de, and Zee, Field G Van. Accumulating householder transformations, revisited. ACM Transactions on Mathematical Software (TOMS), 32(2): , 26. Koutník, Jan, Greff, Klaus, Gomez, Faustino J., and Schmidhuber, Jürgen. A clockwork RNN. CoRR, abs/ , 214. URL abs/ Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arxiv preprint arxiv: , 215. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. ICML (3), 28: , 213. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/ , May 216. URL arxiv.org/abs/ Wisdom, Scott, Powers, Thomas, Hershey, John R, Roux, Jonathan Le, and Atlas, Les. Full-capacity unitary recurrent neural networks. arxiv preprint arxiv: , 216. A. Proof of Theorem 2 Lemma 1. (Giles, 28) Let A, B, and C be real or complex matrices, such that C = f(a, B) where f is some differentiable mapping. Let s be some scalar quantity that is dependent on A, and B through C. Then we have the following identity Tr(C T dc) = Tr(A T da) + Tr(B T db), where da, db, and dc represent infinitesimal perturbations and X = s X, for X = A, B, C. Proof of Theorem 2. Let C = h UT 1 U T h where (U, h) R n m R n and T = striu(u T U)+ 1 2 diag(u T U). Notice that the matrix T can be written using the Hadamard product as follows T = B (U T U), where B = striu(j n ) I n and J n is the matrix of all ones. Calculating the infinitesimal perturbations of C gives dc =(I UT 1 U T )dh dut 1 U T h UT 1 du T h + UT 1 dt T 1 U T h. Using Equation (??) we can write dt = B (du T U + U T du). By substituting this back into the expression of dc, multiplying the left and right-hand sides by C T, and applying the trace we get Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(C T dut 1 U T h) Tr(C T UT 1 du T h) + Tr(C T UT 1 (B (du T U + U T du))t 1 U T h). Now using the identity Tr(AB) = Tr(BA), where the second dimension of A agrees with the first dimension of B, we can rearrange the expression of Tr(C T dc) as follows Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(T 1 U T hc T du) Tr(hC T UT 1 du T ) + Tr(T 1 U T hc T UT 1 (B (du T U + U T du))). To simplify the expression, we will use the short notations C = T T U T C, h = T 1 U T h,

12 Tr(C T dc) becomes Tr(C T dc) = Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr( h C T (B (du T U + U T du))). Now using the two following identities of the trace Tr(A T ) = Tr(A), Tr(A(B C)) = Tr((A B T )C)), we can rewrite Tr(C T dc) as follows Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr(( h C T B T )du T U) + Tr(( h C T B T )U T du). By rearranging and taking the transpose of the third and fourth term of the right-hand side we obtain Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr( Ch T du) + Tr((( C h T ) B)U T du) + Tr((( h C T ) B T )U T du). Factorising by du inside the Tr we get Tr(C T dc) = Tr((C T C T U T )dh) [ T Tr (( hc + Ch T ( C h T ) B + ( h C ] ) ) T ) B T U T du. Using lemma 1 we conclude that [ U =U ( h C T ) B T + ( C h ] T ) B =C U C. h C h T h C T,

Learning Unitary Operators with Help from u(n)

Learning Unitary Operators with Help from u(n) @_hylandsl Learning Unitary Operators with Help from u(n) Stephanie L. Hyland 1,2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich 2 Tri-Institutional Training Program in Computational Biology

More information

arxiv: v1 [stat.ml] 29 Jul 2017

arxiv: v1 [stat.ml] 29 Jul 2017 ORHOGONAL RECURREN NEURAL NEWORKS WIH SCALED CAYLEY RANSFORM Kyle E. Helfrich, Devin Willmott, & Qiang Ye Department of Mathematics University of Kentucky Lexington, KY 40506, USA {kyle.helfrich,devin.willmott,qye3}@uky.edu

More information

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent 1 Xi-Lin Li, lixilinx@gmail.com arxiv:1606.04449v2 [stat.ml] 8 Dec 2016 Abstract This paper studies the performance of

More information

arxiv: v3 [stat.ml] 19 Jun 2018

arxiv: v3 [stat.ml] 19 Jun 2018 arxiv:1707.09520v3 [stat.ml] 19 Jun 2018 Kyle E. Helfrich * 1 Devin Willmott * 1 Qiang Ye 1 Abstract Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer from vanishing or

More information

arxiv: v1 [stat.ml] 31 Oct 2016

arxiv: v1 [stat.ml] 31 Oct 2016 Full-Capacity Unitary Recurrent Neural Networks arxiv:1611.00035v1 [stat.ml] 31 Oct 2016 Scott Wisdom 1, Thomas Powers 1, John R. Hershey 2, Jonathan Le Roux 2, and Les Atlas 1 1 Department of Electrical

More information

Learning Long-Term Dependencies with Gradient Descent is Difficult

Learning Long-Term Dependencies with Gradient Descent is Difficult Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

ORTHOGONAL RECURRENT NEURAL NETWORKS

ORTHOGONAL RECURRENT NEURAL NETWORKS ORHOGONAL RECURREN NEURAL NEWORKS WIH SCALED CAYLEY RANSFORM Anonymous authors Paper under double-blind review ABSRAC Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Long-Short Term Memory

Long-Short Term Memory Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

arxiv: v4 [cs.lg] 25 May 2016

arxiv: v4 [cs.lg] 25 May 2016 arxiv:1511.06464v4 [cs.lg] 25 May 2016 Martin Arjovsky Amar Shah Yoshua Bengio Universidad de Buenos Aires, University of Cambridge, Université de Montréal. Yoshua Bengio is a CIFAR Senior Fellow. Indicates

More information

arxiv: v3 [cs.lg] 25 Oct 2017

arxiv: v3 [cs.lg] 25 Oct 2017 Gated Orthogonal Recurrent Units: On Learning to Forget Li Jing 1, Caglar Gulcehre 2, John Peurifoy 1, Yichen Shen 1, Max Tegmark 1, Marin Soljačić 1, Yoshua Bengio 2 1 Massachusetts Institute of Technology,

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

arxiv: v3 [stat.ml] 10 Jan 2017

arxiv: v3 [stat.ml] 10 Jan 2017 Learning Unitary Operators with Help From u(n) Stephanie L. Hyland 1, 2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich, Switzerland 2 Tri-Institutional Training Program in Computational

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Introduction Neural Turning Machine: Couple a Neural Network with external memory resources The combined

More information

arxiv: v1 [stat.ml] 22 Nov 2016

arxiv: v1 [stat.ml] 22 Nov 2016 Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery arxiv:1611.07252v1 [stat.ml] 22 Nov 2016 Scott Wisdom 1, Thomas Powers 1, James Pitton 1,2, and Les Atlas 1 1 Department of Electrical

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

IMPROVING PERFORMANCE OF RECURRENT NEURAL

IMPROVING PERFORMANCE OF RECURRENT NEURAL IMPROVING PERFORMANCE OF RECURRENT NEURAL NETWORK WITH RELU NONLINEARITY Sachin S. Talathi & Aniket Vartak Qualcomm Research San Diego, CA 92121, USA {stalathi,avartak}@qti.qualcomm.com ABSTRACT In recent

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

arxiv: v2 [cs.lg] 30 May 2017

arxiv: v2 [cs.lg] 30 May 2017 Kronecker Recurrent Units Cijo Jose Idiap Research Institute & EPFL cijo.jose@idiap.ch Moustapha Cissé Facebook AI Research moustaphacisse@fb.com arxiv:175.1142v2 [cs.lg] 3 May 217 François Fleuret Idiap

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

CAN RECURRENT NEURAL NETWORKS WARP TIME?

CAN RECURRENT NEURAL NETWORKS WARP TIME? CAN RECURRENT NEURAL NETWORKS WARP TIME? Corentin Tallec Laboratoire de Recherche en Informatique Université Paris Sud Gif-sur-Yvette, 99, France corentin.tallec@u-psud.fr Yann Ollivier Facebook Artificial

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Faster Training of Very Deep Networks Via p-norm Gates

Faster Training of Very Deep Networks Via p-norm Gates Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:

More information

Deep Gate Recurrent Neural Network

Deep Gate Recurrent Neural Network JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi

More information

Differential Topology Final Exam With Solutions

Differential Topology Final Exam With Solutions Differential Topology Final Exam With Solutions Instructor: W. D. Gillam Date: Friday, May 20, 2016, 13:00 (1) Let X be a subset of R n, Y a subset of R m. Give the definitions of... (a) smooth function

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Learning Long Term Dependencies with Gradient Descent is Difficult

Learning Long Term Dependencies with Gradient Descent is Difficult Learning Long Term Dependencies with Gradient Descent is Difficult IEEE Trans. on Neural Networks 1994 Yoshua Bengio, Patrice Simard, Paolo Frasconi Presented by: Matt Grimes, Ayse Naz Erkan Recurrent

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4 Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix

More information

Differentiation of functions of covariance

Differentiation of functions of covariance Differentiation of log X May 5, 2005 1 Differentiation of functions of covariance matrices or: Why you can forget you ever read this Richard Turner Covariance matrices are symmetric, but we often conveniently

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

Tracking the World State with Recurrent Entity Networks

Tracking the World State with Recurrent Entity Networks Tracking the World State with Recurrent Entity Networks Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun Task At each timestep, get information (in the form of a sentence) about the

More information

Mathematical Methods wk 2: Linear Operators

Mathematical Methods wk 2: Linear Operators John Magorrian, magog@thphysoxacuk These are work-in-progress notes for the second-year course on mathematical methods The most up-to-date version is available from http://www-thphysphysicsoxacuk/people/johnmagorrian/mm

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

arxiv: v1 [stat.ml] 18 Nov 2017

arxiv: v1 [stat.ml] 18 Nov 2017 MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks arxiv:1711.06788v1 [stat.ml] 18 Nov 2017 Minmin Chen Google Mountain view, CA 94043 minminc@google.com Abstract We introduce

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to 1.1. Introduction Linear algebra is a specific branch of mathematics dealing with the study of vectors, vector spaces with functions that

More information

KRONECKER RECURRENT UNITS

KRONECKER RECURRENT UNITS KRONECKER RECURRENT UNITS Cijo Jose Idiap Research Institute & EPFL cijo.jose@idiap.ch Moustapha Cissé Facebook AI Research moustaphacisse@fb.com François Fleuret Idiap Research Institute & EPFL francois.fleuret@idiap.ch

More information

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

On the use of Long-Short Term Memory neural networks for time series prediction

On the use of Long-Short Term Memory neural networks for time series prediction On the use of Long-Short Term Memory neural networks for time series prediction Pilar Gómez-Gil National Institute of Astrophysics, Optics and Electronics ccc.inaoep.mx/~pgomez In collaboration with: J.

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra 1.1. Introduction SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear algebra is a specific branch of mathematics dealing with the study of vectors, vector spaces with functions that

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

arxiv: v1 [cs.lg] 2 Feb 2018

arxiv: v1 [cs.lg] 2 Feb 2018 Short-term Memory of Deep RNN Claudio Gallicchio arxiv:1802.00748v1 [cs.lg] 2 Feb 2018 Department of Computer Science, University of Pisa Largo Bruno Pontecorvo 3-56127 Pisa, Italy Abstract. The extension

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Recurrent Residual Learning for Sequence Classification

Recurrent Residual Learning for Sequence Classification Recurrent Residual Learning for Sequence Classification Yiren Wang University of Illinois at Urbana-Champaign yiren@illinois.edu Fei ian Microsoft Research fetia@microsoft.com Abstract In this paper, we

More information

Mathematics. EC / EE / IN / ME / CE. for

Mathematics.   EC / EE / IN / ME / CE. for Mathematics for EC / EE / IN / ME / CE By www.thegateacademy.com Syllabus Syllabus for Mathematics Linear Algebra: Matrix Algebra, Systems of Linear Equations, Eigenvalues and Eigenvectors. Probability

More information

Implicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks

Implicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks Implicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks Carl-Johan Thore Linköping University - Division of Mechanics 581 83 Linköping - Sweden Abstract. Training

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23 Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23 Content Quick Review of Recurrent Neural Network Introduction

More information

Equivalence Results between Feedforward and Recurrent Neural Networks for Sequences

Equivalence Results between Feedforward and Recurrent Neural Networks for Sequences Equivalence Results between Feedforward and Recurrent Neural Networks for Sequences Alessandro Sperduti Department of Mathematics University of Padova, Italy Email: sperduti@mathunipdit Abstract In the

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information