Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections

Size: px

Start display at page:

Download "Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections"

Jacob Cummings
5 years ago
Views:

1 Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections Different methods have been suggested to solve either the vanishing or exploding gradient problem. The LSTM has been specifically designed to help with the vanishing gradient (Hochreiter & Schmidhuber, 1997). This is achieved by using gate vectors which allow a linear flow of information through the hidden state. However, the LSTM does not diarxiv: v1 [cs.lg] 1 Dec 216 Zakaria Mhammedi The University of Melbourne, Parkville VIC 31, Australia Andrew Hellicar CSIRO, Sandy Bay, TAS, Australia Ashfaqur Rahman CSIRO, Sandy Bay, TAS, Australia James Bailey The University of Melbourne, Parkville VIC 31, Australia Abstract Recurrent Neural Networks (RNNs) have been successfully used in many applications. However, the problem of learning long-term dependencies in sequences using these networks is still a major challenge. Recent methods have been suggested to solve this problem by constraining the transition matrix to be unitary during training, which ensures that its norm is exactly equal to one. These methods either have limited expressiveness or scale poorly with the size of the network when compared to the simple RNN case, especially in an online learning setting. Our contributions are as follows. We first show that constraining the transition matrix to be unitary is a special case of an orthogonal constraint. Therefore, it may not be necessary to work with complex valued matrices. Then we present a new parametrisation of the transition matrix which allows efficient training of an RNN while insuring that the transition matrix is always orthogonal. Using our approach, one online gradient step can, in the worst case, be performed in time complexity O(T n 2 ), where T and n are the length of the input sequence and the size of the hidden layer respectively. This time complexity is the same as the simple RNN case. Finally, we test our new The Department of Computing and Information Systems. Data61. ZMHAMMEDI@STUDENT.UNIMELB.EDU.AU ANDREW.HELLICAR@DATA61.CSIRO.AU ASHFAQUR.RAHMAN@DATA61.CSIRO.AU JAMES.BAILEY@UNIMELB.EDU.AU parametrisation on problems with long-term dependencies. Our results suggest that the orthogonal constraint on the transition matrix has similar benefits to the unitary constraint. 1. Introduction Recurrent Neural Networks (RNNs) have been successfully used in many applications involving time series. This is because RNNs are well suited for sequential data as they process inputs one element at a time and store relevant information in their hidden state. In practice, however, training simple RNNs can be challenging due to the problem of exploding and vanishing gradients (Hochreiter et al., 21). It has been shown that the exploding gradient problem is likely to occur when the transition matrix of an RNN has a spectral norm larger than one (Glorot & Bengio, 21). This results in an error surface, associated with some objective function, having very steep walls (Pascanu et al., 213). Landing on one of these walls during training will result in a very large gradient step, which can disrupt training. On the other hand, when the spectral norm of the transition matrix is less than one, the information at one time step tend to vanish quickly after a few time steps. This makes it challenging to learn long-term dependencies in sequential data.

2 rectly address the exploding gradient problem. In fact, the longer the input sequence is, the more prone the LSTM is to the exploding gradient. One approach to solving this issue is to clip the gradients (Pascanu et al., 213) when their norm exceeds some threshold value. This has the negative side effect of adding an additional hyperparameter to the model. Furthermore, if the exploding gradient occurs within some parameter search space, the associated error surface will still have steep walls. This makes training challenging even with gradient clipping. Another way to approach this problem is to improve the shape of the error surface directly by making it smoother, which can be achieved by constraining the spectral norm of the transition matrix to be less than or equal to one. However, a value of exactly one is best for the vanishing gradient problem. A good choice of the activation function between hidden states is also crucial in this case. These ideas have been investigated in recent work. In particular, the unitary RNN (Arjovsky et al., 215) constrains the transition matrix to be in the unitary group U(n) using a special parametrisation. This ensures that the norm of the transition matrix is exactly equal to one. This parametrisation and other similar ones (Hyland & Rätsch, 216; Wisdom et al., 216) have some advantages and drawbacks which we will discuss in more details in the next section. The main contributions of this work are as follows: We first show that, constraining the search space of the transition matrix of an RNN to the unitary group U(n) is equivalent to limiting the search space to a subset of the orthogonal group O(2n) of a new RNN with twice the hidden size. This suggests that it may not be necessary to work with complex valued matrices. We present a simple and efficient way to parametrise orthogonal transition matrices. We show that the time and space complexities of using this parametrisation are, in the worse case, the same as those of the simple RNN. This new parametrisation can be used in other types of neural networks, not just the recurrent ones. 2. Related Work Throughout this work we will refer to elements of the following simple RNN architecture. h t = φ(w h t 1 + V x t ), (1) o t = Y h t, (2) where W, V and Y are the hidden-to-hidden, input-tohidden, and hidden-to-output weight matrices. h t 1 and h t are the hidden vectors at time steps t 1 and t respectively. Finally, φ is a non-linear activation function. We have omitted the bias terms for simplicity. When looking at the structure of the transition matrix W, recent research has explored how their initialisation influences training and the ability to learn long-term dependencies. In particular, initialising with specific structures such as the identity or an orthogonal matrix can greatly improve performance (Le et al., 215). In addition to these initialisation methods, one study has also considered removing the non-linearity between the hidden-to-hidden connections (Henaff et al., 216), i.e. the term W h t 1 in Equation 1 is outside the activation function φ. This method showed good results when compared to the LSTM on pathologically synthetic problems exhibiting long-term dependencies. After training a model for a few iterations using gradient descent, nothing guarantees that the initial structures of the transition matrix will be kept. In fact, its spectral norm can deviate from one, and the exploding and vanishing gradients can be a problem again. It is possible to constrain the transition matrix to be orthogonal during training using more sophisticated methods (Wisdom et al., 216). However, this comes at the cost of time-complexity. A naive approach is to use gradient descent and project the updated transition matrix onto the orthogonal group using the QR decomposition for example. This incurs a O(n 3 ) cost at each iteration, where n is the size of the hidden layer. This can be an issue when the batch size and the sequence length T are small compared to n. Recently, new parametrisations of the transition matrix have been suggested (Arjovsky et al., 215), which ensure that its spectral norm is always equal to one. This condition is sufficient to ensure that exploding gradients do not occur. Methods such as gradient clipping are made unnecessary. The unitary RNN (urnn) is one example where the hidden matrix W C n n is the product of elementary matrices, consisting of reflection, diagonal, and Fourier transform matrices. When the size of hidden layer is equal to n, the transition matrix has a total of only 7n parameters. Another advantage of this parametrisation is computational efficiency. In particular, the matrix-vector product W v, for some vector v, can be calculated in time complexity O(n log n). However, it has been shown that this parametrisation does not allow the transition matrix to span the full unitary group (Wisdom et al., 216), which may limit the model expressiveness. Another interesting parametrisation (Hyland & Rätsch, 216) has been suggested which takes advantage of the algebraic properties of the unitary group U(n). The idea is to use the corresponding matrix Lie algebra u(n) of skew hermitian matrices. In particular, the transition matrix can

3 be written as follows: n 2 W = exp λ i T i, (3) i=1 where exp is the exponential matrix map and {T i } n2 i=1 are predefined n n matrices forming a bases of the Lie algebra u(n). The learning parameters are the weights {λ i }. The fact that the matrix Lie algebra u(n) is closed and connected ensures that the exponential mapping from u(n) to U(n) is surjective. Therefore, with this parametrisation the search space of the transition matrix spans the whole unitary group. This is one advantage over the original unitary parametrisation (Arjovsky et al., 215). However, the cost of computing the matrix exponential in Equation (3) is O(n 3 ), where n is the dimension of the hidden state. Evaluating n 2 i=1 λ it i can also be computationally expensive. A more recent method (Wisdom et al., 216) performs optimisation directly of the Stiefel manifold. The corresponding model was called full-capacity unitary RNN. Using this approach, the transition matrix can span the full unitary group. However, this method involves a matrix inverse as well as matrix-matrix products which have time complexity O(n 3 ). This can be problematic when the size of the hidden layer is large. This method is somewhat similar to the projection approach discussed earlier. All the methods discussed above, except for the original unitary RNN, involve a step that requires at least a O(n 3 ) time complexity. Furthermore, they all require the use of complex matrices. Table 2 summarises the time complexities of various methods, including our approach, for one online gradient step. In the next section, we show that imposing a unitary constraint on a transition matrix W C n n is equivalent to imposing a constraint in a subset of the orthogonal group O(2n) on a new RNN with twice the hidden size. Furthermore, since the norm of orthogonal matrices is also always equal to one, using the latter has the same theoretical advantages as using unitary matrices when it comes to exploding gradient problem. 3. Complex unitary versus orthogonal In is possible to show that when the transition matrix W C n n of an RNN is unitary, there exists an equivalent representation of this RNN with an orthogonal transition matrix Ŵ R2n 2n. In fact, consider a complex unitary transition matrix W = A + ib C n n, where A and B are now real valued matrices in R n n. Furthermore, consider the following new variables t, ĥt = [ ] [ ] R (ht ) R (V ), ˆV =, Ŵ = I (h t ) I (V ) [ A B B A ], where R and I denote the real and imaginary parts of a complex number. Note that ĥt R 2n, Ŵ R 2n 2n, ˆV R 2n nx, where n x is the dimension of the input vector x t in Equation (1). Assuming that the activation function φ applies to the real and imaginary parts separately, it is easy to show that the update equation of the complex hidden state h t of the unitary RNN can be written in the real space as follows ĥ t = φ(ŵ ĥt 1 + ˆV x t ). (4) Even when the activation function φ does not apply to the real and imaginary parts separately, it is still possible to find an equivalent real space representation. Take for example the activation function proposed by (Arjovsky et al., 215) { ( z + b) z σ modrelu (z) = z, if z + b > (5), otherwise where b is a bias vector. For a hidden state ĥ R2n, the equivalent activation function in the real space representation (4) is given by φ(ĥi) = ĥ2 i +ĥ2 k i +b k i ĥ2 i +ĥ2 ĥ2 i + ĥ2 k i + b k i > ĥ i, if k i, otherwise where k i = (i + n) mod 2n and k i = i mod n for all i {1,..., 2n}. This activation is no longer applied to hidden units independently. Now we will show that the matrix Ŵ is orthogonal. By definition of a unitary matrix, we have W W = I where the symbol indicates the complex conjugate. This equation implies that (6) A 2 + B 2 = I, (7) BA AB =. (8) On the other hand, we have [ Ŵ Ŵ T A = + B 2 BA AB ] AB BA B 2 + A 2. (9) Using Equations (7) and (8), it follows that Ŵ Ŵ T = I. Also note that the matrix W has a special structure. It is a block-matrix and its blocks are the matrices A, B, A, and B. The discussion above shows that using a complex unitary matrix in C n n is equivalent to using an orthogonal matrix, belonging to a subset of O(2n), in a new RNN with twice the hidden size. This is why in this work we focus on parametrising orthogonal matrices directly.

4 Methods Constraint on the Time complexity of one Search space of the transition matrix online gradient step transition matrix urnn (Arjovsky et al., 215) W = 1 O(T n log(n)) A subset of U(n) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Wisdom et al., 216) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Hyland & Rätsch, 216) ornn W = 1 O(T nm) The full O(n) group (Our approach) where m n when m = n Table 1. Table showing the time complexities, associated with one gradient step. for different methods when the size of the hidden layer of an RNN is n and the input is one sequence of length T. In the next section, we present an efficient, and flexible parametrisation of the transition matrix which allows it to span the full orthogonal group. 4. Parametrisation of the transition matrix Before discussing the details of our parametrisation, we need to introduce a few notations. For n N and 2 k n, we define the mappings H k : R k R n n [ ] In k u I k 2uu T, (1) where I k denotes the k-dimensional identity matrix. Note that, when v = 1, H k (u) is simply the Householder Reflection in R n n, which is orthogonal. For the special case of k = 1, we define H 1 as H 1 : R R n n [ ] In 1 u, u (11) where u is simply a scalar in this case. Note that, when u {1, 1}, H 1 (u) is orthogonal. We propose to parametrise the transition matrix of an RNN using the vectors {u i }, which we will refer to as the reflection vectors in what follows. This parametrisation has many advantage as the following theorem will help us show. Theorem 1. Let n N and 1 k n we define the mapping M k as follows M k : R k R n R n n (u k,..., u n ) H n (u n )... H k (u k ), where the functions {H k } n k=1 are defined in Equations (1) and (11). Then we have, The mappings M k are smooth for all 1 k n. The image of M 1 includes the orthogonal group O(n), i.e. O(n) M 1 [R R n ]. Sketch of the proof. The first point of Theorem 1 is straightforward. The mapping M k is the product of n k + 1 smooth maps. Therefore, M k is also a smooth map. Now we need to the show that for every Q O(n), there exits a tuple of vectors (u 1,..., u n ) R R n such that Q = M 1 (u 1,..., u n ). Algorithm 1 show how a QR decomposition can be performed using the matrices {H k (u k )} n k=1 while insuring that the upper triangular matrix R has positive diagonal elements. If we apply this algorithm to an orthogonal matrix Q, we get a tuple (u 1,..., u n ) which satisfies QR = H n (u n )... H 1 (u 1 )R = Q. Note that the matrix R must be orthogonal since R = Q T Q. Therefore, R = I, since the only upper triangular matrix with positive diagonal elements is the identity matrix. Hence, we have M 1 (u 1,..., u n ) = H n (u n )... H 1 (u 1 ) = Q. When using m reflection vectors, the parametrisation can be expressed as W = M n m+1 (u n m+1,..., v n ) = H n (u n )... H n m+1 (u n m+1 ), (12) This parametrisation has the following advantages The parametrisation is smooth, which is convenient for training with gradient descent.

5 Algorithm 1 QR decomposition using the mappings {H k } n k=1. For a matrix R Rn n, {R k,k } 1 k n denote its diagonal elements, and R k..n,k R n k+1 is a vector representing the part of the k-th column below the k-th row (including the element of the k-th row ). Require: A R n n is a full-rank matrix. Ensure: Q and R where Q = H n (u n )... H 1 (u 1 ) and R and is upper triangular with positive diagonal elements such that A = QR R A Q I {Initialise Q to the identity matrix} for k = 1 to n 1 do if R k,k == R k..n,k then u n k+1 = [,...,, 1] T R n k+1 else u n k+1 R k..n,k R k..n,k [1,,..., ] T u n k+1 u n k+1 / u n k+1 end if R H n k+1 (u n k+1 )R Q QH n k+1 (u n k+1 ) end for u 1 = sgn(r n,n ) R R H 1 (u 1 )R Q QH 1 (u 1 ) When m = n, the transition matrix can span the whole orthogonal group O(n) The only thing needed for the matrix W to be orthogonal is for the vectors {u i } to have norm one. To ensure that this is always the case, one method would be to project the vectors onto the unit ball after each gradient step. This would take O(nm) time complexity. The time and space complexities involved in one gradient step are, in the worst case, the same as that of the simple RNN with the same number of hidden units. This is mainly due to the fact that the transition matrix W does not need to be generated explicitly. In fact, one only needs the matrix-vector products W v, which are cheaper to compute using our parametrisation. The parametrisation is flexible. A good trade-off between expressiveness and speed can be found by selecting the right number of reflection vectors. When the goal is to ensure that the transition matrix W can span the whole orthogonal group, the number of reflection vectors and hidden units be the same. In this case, the number of redundant parameters associated with the transition matrix W is small. In fact, when m = n the total number of parameters needed for W is n(n + 1)/2. There are only n redundant parameters in this case, since the orthogonal group O(n) is a n(n 1)/2 manifold. Note that H 1 (u 1 ) is not the standard Householder reflection. The second point of Theorem (i.e. the subjectivity) would not be true had we used a Householder reflection instead Time complexity When using the parametrisation in (12), we do not need to calculate the matrix W explicitly. We only need the matrix-vector products W v. The cost of calculating W v is proportional to O(nm), where m and n are the number of reflection vectors used and the size of the hidden state respectively. Consequently, given a sequence of length T, the time complexity of Forward Propagation (FP) is O(T mn). Back Propagation Through Time (BPTT) can be achieved within the same time complexity as FP. Note that for a simple RNN the complexity of FP and BPTT are equal to O(T n 2 ) for one sequence of length T. Therefore, training an RNN using our parametrisation incurs, in the worst case, the same time complexity as the simple RNN. The worst case being when the number of reflection vectors is equal to the size of the hidden state Space complexity Parametrising the transition matrix of an RNN using (12) with m reflections is equivalent to adding m 1 intermediate hidden states between any two consecutive time steps. For time step t, these vectors are h t,k = H n m+k (u n m+k )... H n m+1 (u n m+1 )h t, (13) for 1 k m 1, with the conventions h t,m = h t+1, = h t+1. Using an algorithm such as BPTT to calculate the gradients, requires the values of the hidden states, including the intermediate ones, in order to propagate the gradients. A naive approach would be to store the values of h t,k for all t T and 1 k m 1. This method would increase the memory storage by a factor of m compared to the case of a simple RNN. This may be acceptable if m is fixed irrespective of the size of the hidden state n. However, when m = n, it is clearly a problem. In order to circumvent this issue, note that the h t,k for 1 k m 1 do not need to be explicitly stored during the forward propagation. In fact, they can be calculated on the fly during BPTT, which would only add O(mn) operations at each time step. In fact, from Equation (13), the only ingredients needed to generate the intermediate hidden states at time step t are the reflection vectors {u i } m i=1 and the hidden state h t. Note that having to generate the intermediate vectors locally does not affect the overall time complexity of the BPTT algorithm. Adding O(mn) operations at each time step still results in an overall complexity

6 of O(T mn) for one sequence of length T. However, unless one implements the BPTT algorithm from scratch, it would be challenging to modify any pre-existing library to manage memory in the way needed for our parametrisation Compact WY representation In order to make the implementation of our method feasible we need to derive the expanded mathematical expressions of the following gradients with respect to some scalar loss function L [ ] T (W h) U = U (W h), (14) h = [ (W h) h ] T (W h), (15) where W is the transition matrix and U = (u n... u n m+1 ) is the matrix whose columns are the reflection vectors. Furthermore, we need to be able to compute these gradients in a O(mn) time complexity in order to achieve the desired global time and space complexities discussed in the previous subsections. Having the explicit expressions of U, h will allow an implementation that does not require the storage of any intermediate hidden states for BPTT. Note that in the case of an RNN, U represents the immediate gradient at any time step t. In which case, the hidden states {h k } k t are affected by small perturbations of U only through the hidden state h t. In order to write the mathematical expressions of the gradients in Equations (14) and (15), we will use the compact WY representation (Joffrain et al., 26) of the product of Householder Reflections, which is described by the following proposition. Proposition 1. (Joffrain et al., 26) Let n N and m n 1. For n m + 1 i n, let v i R n i+1 and U = (u n... u n m+1 ) be the matrix whose columns are the {v i } vectors. Then H n (u n ),..., H n m+1 (u n m+1 ) = I n UT 1 U T, (16) where T = striu(u T U) diag(u T U), (17) with striu(u T U), and diag(u T U) representing the strictly upper part and the diagonal of the matrix U T U, respectively. This is the compact WY representation. For the particular case of m = n, in Proposition 1. The compact WY representation can be written as H n (u n )... H 1 (u 1 ) = ( I n UT 1 U T ) H 1 (u 1 ), (18) where U = (u n... u 2 ), and H 1 (u 1 ) is defined in Equation (11). Theorem 2. Let m n 1 and W = I UT 1 U T, where U = (u n... u n m+1 ). Let J n R n n be the matrix of all ones, and L be a scalar loss function. We have [ U =U ( h C T ) B T + ( C h ] T ) B C h T h C T, (19) h =C U C, (2) where C = W h, C = C, C = T T U T C, h = T 1 U T h, B = striu(j n ) I n, and denote the Hadamard product. The gradients U, and h as defined in Equations (19) and (2) can be calculated in O(mn) time and space complexities. The proof of Equations (19) and (2) is provided in the appendix. Algorithms 2 and 3 show routines which can calculate T 1 v and T T v in time complexity O(mn) for any vector v. Algorithm 4 shows a routine which calculates U [ (vw T ) B T + (wv T ) B ] for any vectors v and w, also in time complexity O(mn). Using these routines the gradients U and h can be calculated in time complexity O(mn). Algorithm 2 Calculating T 1 v for any vector v. matrix U R n m, U,k denotes its k-th column. For a Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T 1 v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = m to 1 do o k 2(v k U T,k t)/ U,m t t + o k U,k end for Algorithm 3 Calculating T T v for any vector v. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T T v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = 1 to m do o k 2(v k U T,k t)/ U,m t t + o k U,k end for

7 Algorithm 4 Calculating U [ (vw T ) B T + (wv T ) B ] for any vector v and w in R m. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v, w R m. Ensure: Z = U [ (vw T ) B T + (wv T ) B ] R n m, where o R n Z n,m R n m {initialising Z with zeros} t [,..., ] T R n {temporary vector} l [,..., ] T R n {temporary vector} for k = 1 to m 1 do t t + U,k v k Z,k+1 Z,k+1 + w k+1 t l l + U,m k+1 w m k+1 Z,m k+1 Z,m k+1 + v m k+1 l end for l l + U,1 w 1 Z,1 Z,1 + v 1 l Orthogonal Parametrisation of Recurrent Neural Networks We fixed the length T = 1 of the input sequence and varied the number of hidden units n of the ornn in the set {128, 256, 512, 124, 248} (see Figure 1(a)). The number of reflection vectors was equal to the hidden size (i.e m = n). Our method scales in O(n 2 ), compared with O(n 3 ) for the naive approach as expected. Our method is faster than the later when n 25. Here, we fixed the length of the input sequence to T = 1, and the number of hidden units to n = 248 and we varied the number of reflections m = {128, 256, 512, 124, 248} (See Figure 1(b)). Our method scales in O(m). The last case consisted of fixing the number of hidden units and reflections to n = m = 128 and varying the length of the input sequence in {64, 128, 256, 512, 124, 248, 496}. The computational time increased linearly with the length of the sequence (See Figure 1(c)). This first experiment, confirms that, using our approach, BPTT can be performed in O(T mn) time complexity for an input sequence of length T. 5. Experiments We have implemented and tested our method using the python library theano (Theano Development Team, 216). We implemented a tensor operator F that takes a matrix W and a vector h and returns the product W h. In order to enable automatic differentiation, we also implemented a gradient method associated with the operator F which calculates the gradients in Equations 19 and 2, using Algorithms 2, 3, and Scalability In this first experiment we tested the scalability of our method by measuring the time required for one gradient calculation. The input consisted of one randomly generated 2d sequence. The target output was a fixed scalar value at the end of the sequence. We compared our method to a naive approach where the orthogonal matrix is generated explicitly before the FP using the compact representation W = I UT 1 U T. Note that even though the matrix T is triangular and its inverse can be calculated in time complexity O(m 2 ), generating W would still require O(nm 2 ) complexity because of the matrix-matrix products involved. Our method, on the other hand, does not require the explicit computation of W. We considered the following situations: 5.2. Application to problems with long-term dependencies We have tested the new parametrisation on the three different datasets all having long-term dependencies. We will refer to the RNN parametrised using the Householder reflections as the ornn. We set its activation function to the leaky Rectified Linear Unit (Relu) defined as { x, if x > φ(x) = (21).1x, otherwise For all experiments, we used the adam method for stochastic gradient descent. We tested different learning rates in the set {1 1, 1 2, 1 3, 1 4 } for every model and selected the ones which resulted in the fastest convergence on the training sets. We initialised all the parameters using uniform distributions similar to (Arjovsky et al., 215). The biases of all model were set to zero, except for the forget bias of the LSTM, which we set to 5 in order to facilitate the learning of long-term dependencies (Koutník et al., 214). To ensure that the transition matrix of the ornn is always orthogonal, we projected each reflection vector u i onto the unit i-sphere after each gradient iteration SEQUENCE GENERATION In this experiment, we followed a similar setting to (Koutník et al., 214) where we trained RNNs to encode song excerpts. We used the track Manyrista from album Musica Deposita by Cuprum. We extracted five consecutive excerpts around the beginning of the song, each having 8 data points and corresponding to 18ms with a 44.1Hz sampling frequency. We trained an srnn, LSTM,

8 NMSE Time t in seconds Time t in seconds Time t in seconds Orthogonal Parametrisation of Recurrent Neural Networks 1 2 Varying hidden size n 1 2 Varying # of reflections m A t ~ m Varying sequence length T 1 1 t ~ n 3! A t ~ n A t ~ m t ~ T! A t ~ T optimised naive n (a) Time required for one gradient step when the input is one two-dimensional sequence of length T = 1. The size of the hidden layer n takes values in {128, 256, 512, 124, 248}. 1 optimised naive m (b) Time required for one gradient step when the size of the hidden layer n and the length of the input sequence are fixed to 248 and 1 respectively. The number of reflection m takes values in {128, 256, 512, 124, 248}. 1-1 optimised T (c) Time required for one gradient step when the size of the hidden layer n is fixed to 128. The input is one 2d sequence of length T {64, 128, 256, 512, 124, 248, 496}. Figure 1. Scalability tests. and ornn for 5 epochs on each of the pieces with five random seeds. For each run, the lowest Normalised Mean Squared Error (NMSE) during the 5 epochs was recorded. For each model, we tested three different hidden sizes. The total number of parameters corresponding to these hidden sizes was approximately equal to 25, 5, and 1. For the ornn, we chose the number of reflection vectors to be equal to the hidden size for each case, so that the transition matrix is allowed to span the full orthogonal group. The results are shown in Figures 2 and 3. All the learning rate were set to.1. The orthogonal parametrisation outperformed the simple RNN and performed on average better than the LSTM on the sequence encoding task PIXEL MNIST We used the MNIST image dataset in this experiment. We split the dataset into train set (55 instances), validation (5 instances), and test set (1 instances). We trained an ornn with n = 128 hidden units and m = 16 reflections, which corresponds to approximately 36 parameters, to minimise the cross-entropy. We set the batch size to 64, and we selected the best model based on the lowest validation cost after 2 epochs. The later was evaluated at every 2 training iterations. All the learning rates were set to.1 in this experiment. Table compares our test performance against the urnn results available in the literature. The models we used for comparison all have about 16K parameters, which is more than four time the number of parameters of our tested model. Despite this, our model still achieved a 95.6% test accuracy which is the second best result LSTM # param ornn # param srnn # param Figure 3. Results summary for the sequence generation task. The plots show the NMSE distribution for the different models with respect to the total number of parameters. The horizontal red lines represent the medians of the NMSE over 25 data points (i.e. five seeds for each of the five song excerpts). The solid rectangles and the dashed bars represent the [25% 75%] and [9% 91%] confidence intervals respectively.

9 1 LSTM ornn srnn Figure 2. The RNN generated sequences against the true data for two of the five excerpts used. We only displayed the best performing models for to total number of parameters N p 1. This corresponds to the hidden sizes 15, 3, and 41 for the LSTM, srnn, and ornn respectively. The number of reflections was set to m = n for the ornn. The srnn showed the worst performance. Model hidden size Number of validation test (# reflections) parameters accuracy accuracy ornn 128 (m=16) 3.6K 96.8 % 95.6 % urnn (Arjovsky et al., 215) K % RC urnn (Jia, 216) K % FC urnn (Jia, 216) K % Table 2. Results summary for the MNIST digit classification experiment, and comparison with the urnn results available in the literature. FC and RC stand for Full-Capacity and Restricted Capacity respectively ADDING TASK In this experiment, we follow a similar setting to (Arjovsky et al., 215), where the goal of the RNN is to output the sum of two elements in the first dimension of a 2d sequence. The location of the two elements to be summed are specified by the entries in the second dimension of the input sequence. In particular, the first dimension of every input sequence consists of random numbers between and 1. The second dimension has all zeros except for two elements which are equal to 1. The first unit entry is located in the first half of the sequence, and the second one in the second half. We tested two different lags T = 4 and T = 8. All models were trained to minimise the Mean Squared Error (MSE). The baseline MSE for this task is.167 corresponding to a model that always outputs 1. We trained an ornn with n = 128 hidden units and m = 16 reflections similar to the MNIST experiment. We trained an LSTM and an srnn with hidden sizes 28 and 54, respectively, corresponding to a total number of parameters 36 (i.e. same as the ornn model). We chose a batch size of 5, and after each iteration, a new set of sequences was generated randomly. The learning rate for the ornn was set to.1. Figure 5(a) and 5(b) displays the results. The ornn was able to beat the baseline MSE in less than 5 iterations for both lags and for two different random initialisation seeds. This is in line with the results of the unitary RNN (Arjovsky et al., 215). 6. Discussion In this work, we presented a new efficient way to parametrise transition matrices of a recurrent neural network using Householder reflections. This method allows an easy and computationally efficient way to enforce an orthogonal constraint on the transition matrix which then ensures that exploding gradients do not occur during training. We have also shown that enforcing a unitary constraint on the transition matrix is a special case of the orthogonal one. Experimental results show that the orthogonal parametrisation has similar advantages to the unitary one. Acknowledgment The authors would like to acknowledge Department of State Growth Tasmania for funding this work through SenseT. References Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. arxiv preprint arxiv: , 215. Giles, Mike B. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. 28. Glorot, Xavier and Bengio, Yoshua. Understanding the dif-

10 Cross entropy MSE Accuracy MSE Orthogonal Parametrisation of Recurrent Neural Networks 1.8 Convergence plot (Accuracy).5.4 T = 4 LSTM srnn ornn ornn epochs (a) MNIST classification experiment. Validation accuracy of the ornn with n = 128 and m = 16 as a function of number of epochs Convergence plot (Cross entropy) ornn x 1 iterations (a) T = 8 LSTM srnn ornn epochs (b) MNIST classification experiment. Cross-entropy on the validation set of the ornn with n = 128 and m = 16 as a function of number of epochs. Figure 4. Convergence plots of the MNIST classification experiment x 1 iterations (b) Figure 5. Results of the addition task. For each lag T, the red curves represent two runs of the ornn model with two different random initialisation seeds. Notice that the LSTM and ornn did not beat the baseline MSE of.167.

11 ficulty of training deep feedforward neural networks. In Aistats, volume 9, pp , 21. Henaff, Mikael, Szlam, Arthur, and LeCun, Yann. Orthogonal rnns and long-memory tasks. arxiv preprint arxiv: , 216. Hochreiter, Sepp and Schmidhuber, Juergen. Long Short- Term Memory. 9(8): , Hochreiter, Sepp, Bengio, Yoshua, Frasconi, Paolo, and Schmidhuber, Jürgen. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 21. Hyland, Stephanie L and Rätsch, Gunnar. Learning unitary operators with help from u (n). arxiv preprint arxiv: , 216. Jia, Kui. Improving training of deep neural networks via singular value bounding. arxiv preprint arxiv: , 216. Joffrain, Thierry, Low, Tze Meng, Quintana-Ortí, Enrique S, Geijn, Robert van de, and Zee, Field G Van. Accumulating householder transformations, revisited. ACM Transactions on Mathematical Software (TOMS), 32(2): , 26. Koutník, Jan, Greff, Klaus, Gomez, Faustino J., and Schmidhuber, Jürgen. A clockwork RNN. CoRR, abs/ , 214. URL abs/ Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arxiv preprint arxiv: , 215. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. ICML (3), 28: , 213. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/ , May 216. URL arxiv.org/abs/ Wisdom, Scott, Powers, Thomas, Hershey, John R, Roux, Jonathan Le, and Atlas, Les. Full-capacity unitary recurrent neural networks. arxiv preprint arxiv: , 216. A. Proof of Theorem 2 Lemma 1. (Giles, 28) Let A, B, and C be real or complex matrices, such that C = f(a, B) where f is some differentiable mapping. Let s be some scalar quantity that is dependent on A, and B through C. Then we have the following identity Tr(C T dc) = Tr(A T da) + Tr(B T db), where da, db, and dc represent infinitesimal perturbations and X = s X, for X = A, B, C. Proof of Theorem 2. Let C = h UT 1 U T h where (U, h) R n m R n and T = striu(u T U)+ 1 2 diag(u T U). Notice that the matrix T can be written using the Hadamard product as follows T = B (U T U), where B = striu(j n ) I n and J n is the matrix of all ones. Calculating the infinitesimal perturbations of C gives dc =(I UT 1 U T )dh dut 1 U T h UT 1 du T h + UT 1 dt T 1 U T h. Using Equation (??) we can write dt = B (du T U + U T du). By substituting this back into the expression of dc, multiplying the left and right-hand sides by C T, and applying the trace we get Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(C T dut 1 U T h) Tr(C T UT 1 du T h) + Tr(C T UT 1 (B (du T U + U T du))t 1 U T h). Now using the identity Tr(AB) = Tr(BA), where the second dimension of A agrees with the first dimension of B, we can rearrange the expression of Tr(C T dc) as follows Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(T 1 U T hc T du) Tr(hC T UT 1 du T ) + Tr(T 1 U T hc T UT 1 (B (du T U + U T du))). To simplify the expression, we will use the short notations C = T T U T C, h = T 1 U T h,

12 Tr(C T dc) becomes Tr(C T dc) = Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr( h C T (B (du T U + U T du))). Now using the two following identities of the trace Tr(A T ) = Tr(A), Tr(A(B C)) = Tr((A B T )C)), we can rewrite Tr(C T dc) as follows Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr(( h C T B T )du T U) + Tr(( h C T B T )U T du). By rearranging and taking the transpose of the third and fourth term of the right-hand side we obtain Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr( Ch T du) + Tr((( C h T ) B)U T du) + Tr((( h C T ) B T )U T du). Factorising by du inside the Tr we get Tr(C T dc) = Tr((C T C T U T )dh) [ T Tr (( hc + Ch T ( C h T ) B + ( h C ] ) ) T ) B T U T du. Using lemma 1 we conclude that [ U =U ( h C T ) B T + ( C h ] T ) B =C U C. h C h T h C T,

Learning Unitary Operators with Help from u(n)

@_hylandsl Learning Unitary Operators with Help from u(n) Stephanie L. Hyland 1,2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich 2 Tri-Institutional Training Program in Computational Biology