Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections
|
|
- Jacob Cummings
- 5 years ago
- Views:
Transcription
1 Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections Different methods have been suggested to solve either the vanishing or exploding gradient problem. The LSTM has been specifically designed to help with the vanishing gradient (Hochreiter & Schmidhuber, 1997). This is achieved by using gate vectors which allow a linear flow of information through the hidden state. However, the LSTM does not diarxiv: v1 [cs.lg] 1 Dec 216 Zakaria Mhammedi The University of Melbourne, Parkville VIC 31, Australia Andrew Hellicar CSIRO, Sandy Bay, TAS, Australia Ashfaqur Rahman CSIRO, Sandy Bay, TAS, Australia James Bailey The University of Melbourne, Parkville VIC 31, Australia Abstract Recurrent Neural Networks (RNNs) have been successfully used in many applications. However, the problem of learning long-term dependencies in sequences using these networks is still a major challenge. Recent methods have been suggested to solve this problem by constraining the transition matrix to be unitary during training, which ensures that its norm is exactly equal to one. These methods either have limited expressiveness or scale poorly with the size of the network when compared to the simple RNN case, especially in an online learning setting. Our contributions are as follows. We first show that constraining the transition matrix to be unitary is a special case of an orthogonal constraint. Therefore, it may not be necessary to work with complex valued matrices. Then we present a new parametrisation of the transition matrix which allows efficient training of an RNN while insuring that the transition matrix is always orthogonal. Using our approach, one online gradient step can, in the worst case, be performed in time complexity O(T n 2 ), where T and n are the length of the input sequence and the size of the hidden layer respectively. This time complexity is the same as the simple RNN case. Finally, we test our new The Department of Computing and Information Systems. Data61. ZMHAMMEDI@STUDENT.UNIMELB.EDU.AU ANDREW.HELLICAR@DATA61.CSIRO.AU ASHFAQUR.RAHMAN@DATA61.CSIRO.AU JAMES.BAILEY@UNIMELB.EDU.AU parametrisation on problems with long-term dependencies. Our results suggest that the orthogonal constraint on the transition matrix has similar benefits to the unitary constraint. 1. Introduction Recurrent Neural Networks (RNNs) have been successfully used in many applications involving time series. This is because RNNs are well suited for sequential data as they process inputs one element at a time and store relevant information in their hidden state. In practice, however, training simple RNNs can be challenging due to the problem of exploding and vanishing gradients (Hochreiter et al., 21). It has been shown that the exploding gradient problem is likely to occur when the transition matrix of an RNN has a spectral norm larger than one (Glorot & Bengio, 21). This results in an error surface, associated with some objective function, having very steep walls (Pascanu et al., 213). Landing on one of these walls during training will result in a very large gradient step, which can disrupt training. On the other hand, when the spectral norm of the transition matrix is less than one, the information at one time step tend to vanish quickly after a few time steps. This makes it challenging to learn long-term dependencies in sequential data.
2 rectly address the exploding gradient problem. In fact, the longer the input sequence is, the more prone the LSTM is to the exploding gradient. One approach to solving this issue is to clip the gradients (Pascanu et al., 213) when their norm exceeds some threshold value. This has the negative side effect of adding an additional hyperparameter to the model. Furthermore, if the exploding gradient occurs within some parameter search space, the associated error surface will still have steep walls. This makes training challenging even with gradient clipping. Another way to approach this problem is to improve the shape of the error surface directly by making it smoother, which can be achieved by constraining the spectral norm of the transition matrix to be less than or equal to one. However, a value of exactly one is best for the vanishing gradient problem. A good choice of the activation function between hidden states is also crucial in this case. These ideas have been investigated in recent work. In particular, the unitary RNN (Arjovsky et al., 215) constrains the transition matrix to be in the unitary group U(n) using a special parametrisation. This ensures that the norm of the transition matrix is exactly equal to one. This parametrisation and other similar ones (Hyland & Rätsch, 216; Wisdom et al., 216) have some advantages and drawbacks which we will discuss in more details in the next section. The main contributions of this work are as follows: We first show that, constraining the search space of the transition matrix of an RNN to the unitary group U(n) is equivalent to limiting the search space to a subset of the orthogonal group O(2n) of a new RNN with twice the hidden size. This suggests that it may not be necessary to work with complex valued matrices. We present a simple and efficient way to parametrise orthogonal transition matrices. We show that the time and space complexities of using this parametrisation are, in the worse case, the same as those of the simple RNN. This new parametrisation can be used in other types of neural networks, not just the recurrent ones. 2. Related Work Throughout this work we will refer to elements of the following simple RNN architecture. h t = φ(w h t 1 + V x t ), (1) o t = Y h t, (2) where W, V and Y are the hidden-to-hidden, input-tohidden, and hidden-to-output weight matrices. h t 1 and h t are the hidden vectors at time steps t 1 and t respectively. Finally, φ is a non-linear activation function. We have omitted the bias terms for simplicity. When looking at the structure of the transition matrix W, recent research has explored how their initialisation influences training and the ability to learn long-term dependencies. In particular, initialising with specific structures such as the identity or an orthogonal matrix can greatly improve performance (Le et al., 215). In addition to these initialisation methods, one study has also considered removing the non-linearity between the hidden-to-hidden connections (Henaff et al., 216), i.e. the term W h t 1 in Equation 1 is outside the activation function φ. This method showed good results when compared to the LSTM on pathologically synthetic problems exhibiting long-term dependencies. After training a model for a few iterations using gradient descent, nothing guarantees that the initial structures of the transition matrix will be kept. In fact, its spectral norm can deviate from one, and the exploding and vanishing gradients can be a problem again. It is possible to constrain the transition matrix to be orthogonal during training using more sophisticated methods (Wisdom et al., 216). However, this comes at the cost of time-complexity. A naive approach is to use gradient descent and project the updated transition matrix onto the orthogonal group using the QR decomposition for example. This incurs a O(n 3 ) cost at each iteration, where n is the size of the hidden layer. This can be an issue when the batch size and the sequence length T are small compared to n. Recently, new parametrisations of the transition matrix have been suggested (Arjovsky et al., 215), which ensure that its spectral norm is always equal to one. This condition is sufficient to ensure that exploding gradients do not occur. Methods such as gradient clipping are made unnecessary. The unitary RNN (urnn) is one example where the hidden matrix W C n n is the product of elementary matrices, consisting of reflection, diagonal, and Fourier transform matrices. When the size of hidden layer is equal to n, the transition matrix has a total of only 7n parameters. Another advantage of this parametrisation is computational efficiency. In particular, the matrix-vector product W v, for some vector v, can be calculated in time complexity O(n log n). However, it has been shown that this parametrisation does not allow the transition matrix to span the full unitary group (Wisdom et al., 216), which may limit the model expressiveness. Another interesting parametrisation (Hyland & Rätsch, 216) has been suggested which takes advantage of the algebraic properties of the unitary group U(n). The idea is to use the corresponding matrix Lie algebra u(n) of skew hermitian matrices. In particular, the transition matrix can
3 be written as follows: n 2 W = exp λ i T i, (3) i=1 where exp is the exponential matrix map and {T i } n2 i=1 are predefined n n matrices forming a bases of the Lie algebra u(n). The learning parameters are the weights {λ i }. The fact that the matrix Lie algebra u(n) is closed and connected ensures that the exponential mapping from u(n) to U(n) is surjective. Therefore, with this parametrisation the search space of the transition matrix spans the whole unitary group. This is one advantage over the original unitary parametrisation (Arjovsky et al., 215). However, the cost of computing the matrix exponential in Equation (3) is O(n 3 ), where n is the dimension of the hidden state. Evaluating n 2 i=1 λ it i can also be computationally expensive. A more recent method (Wisdom et al., 216) performs optimisation directly of the Stiefel manifold. The corresponding model was called full-capacity unitary RNN. Using this approach, the transition matrix can span the full unitary group. However, this method involves a matrix inverse as well as matrix-matrix products which have time complexity O(n 3 ). This can be problematic when the size of the hidden layer is large. This method is somewhat similar to the projection approach discussed earlier. All the methods discussed above, except for the original unitary RNN, involve a step that requires at least a O(n 3 ) time complexity. Furthermore, they all require the use of complex matrices. Table 2 summarises the time complexities of various methods, including our approach, for one online gradient step. In the next section, we show that imposing a unitary constraint on a transition matrix W C n n is equivalent to imposing a constraint in a subset of the orthogonal group O(2n) on a new RNN with twice the hidden size. Furthermore, since the norm of orthogonal matrices is also always equal to one, using the latter has the same theoretical advantages as using unitary matrices when it comes to exploding gradient problem. 3. Complex unitary versus orthogonal In is possible to show that when the transition matrix W C n n of an RNN is unitary, there exists an equivalent representation of this RNN with an orthogonal transition matrix Ŵ R2n 2n. In fact, consider a complex unitary transition matrix W = A + ib C n n, where A and B are now real valued matrices in R n n. Furthermore, consider the following new variables t, ĥt = [ ] [ ] R (ht ) R (V ), ˆV =, Ŵ = I (h t ) I (V ) [ A B B A ], where R and I denote the real and imaginary parts of a complex number. Note that ĥt R 2n, Ŵ R 2n 2n, ˆV R 2n nx, where n x is the dimension of the input vector x t in Equation (1). Assuming that the activation function φ applies to the real and imaginary parts separately, it is easy to show that the update equation of the complex hidden state h t of the unitary RNN can be written in the real space as follows ĥ t = φ(ŵ ĥt 1 + ˆV x t ). (4) Even when the activation function φ does not apply to the real and imaginary parts separately, it is still possible to find an equivalent real space representation. Take for example the activation function proposed by (Arjovsky et al., 215) { ( z + b) z σ modrelu (z) = z, if z + b > (5), otherwise where b is a bias vector. For a hidden state ĥ R2n, the equivalent activation function in the real space representation (4) is given by φ(ĥi) = ĥ2 i +ĥ2 k i +b k i ĥ2 i +ĥ2 ĥ2 i + ĥ2 k i + b k i > ĥ i, if k i, otherwise where k i = (i + n) mod 2n and k i = i mod n for all i {1,..., 2n}. This activation is no longer applied to hidden units independently. Now we will show that the matrix Ŵ is orthogonal. By definition of a unitary matrix, we have W W = I where the symbol indicates the complex conjugate. This equation implies that (6) A 2 + B 2 = I, (7) BA AB =. (8) On the other hand, we have [ Ŵ Ŵ T A = + B 2 BA AB ] AB BA B 2 + A 2. (9) Using Equations (7) and (8), it follows that Ŵ Ŵ T = I. Also note that the matrix W has a special structure. It is a block-matrix and its blocks are the matrices A, B, A, and B. The discussion above shows that using a complex unitary matrix in C n n is equivalent to using an orthogonal matrix, belonging to a subset of O(2n), in a new RNN with twice the hidden size. This is why in this work we focus on parametrising orthogonal matrices directly.
4 Methods Constraint on the Time complexity of one Search space of the transition matrix online gradient step transition matrix urnn (Arjovsky et al., 215) W = 1 O(T n log(n)) A subset of U(n) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Wisdom et al., 216) Full-capacity urnn W = 1 O(T n 2 + n 3 ) The full U(n) group (Hyland & Rätsch, 216) ornn W = 1 O(T nm) The full O(n) group (Our approach) where m n when m = n Table 1. Table showing the time complexities, associated with one gradient step. for different methods when the size of the hidden layer of an RNN is n and the input is one sequence of length T. In the next section, we present an efficient, and flexible parametrisation of the transition matrix which allows it to span the full orthogonal group. 4. Parametrisation of the transition matrix Before discussing the details of our parametrisation, we need to introduce a few notations. For n N and 2 k n, we define the mappings H k : R k R n n [ ] In k u I k 2uu T, (1) where I k denotes the k-dimensional identity matrix. Note that, when v = 1, H k (u) is simply the Householder Reflection in R n n, which is orthogonal. For the special case of k = 1, we define H 1 as H 1 : R R n n [ ] In 1 u, u (11) where u is simply a scalar in this case. Note that, when u {1, 1}, H 1 (u) is orthogonal. We propose to parametrise the transition matrix of an RNN using the vectors {u i }, which we will refer to as the reflection vectors in what follows. This parametrisation has many advantage as the following theorem will help us show. Theorem 1. Let n N and 1 k n we define the mapping M k as follows M k : R k R n R n n (u k,..., u n ) H n (u n )... H k (u k ), where the functions {H k } n k=1 are defined in Equations (1) and (11). Then we have, The mappings M k are smooth for all 1 k n. The image of M 1 includes the orthogonal group O(n), i.e. O(n) M 1 [R R n ]. Sketch of the proof. The first point of Theorem 1 is straightforward. The mapping M k is the product of n k + 1 smooth maps. Therefore, M k is also a smooth map. Now we need to the show that for every Q O(n), there exits a tuple of vectors (u 1,..., u n ) R R n such that Q = M 1 (u 1,..., u n ). Algorithm 1 show how a QR decomposition can be performed using the matrices {H k (u k )} n k=1 while insuring that the upper triangular matrix R has positive diagonal elements. If we apply this algorithm to an orthogonal matrix Q, we get a tuple (u 1,..., u n ) which satisfies QR = H n (u n )... H 1 (u 1 )R = Q. Note that the matrix R must be orthogonal since R = Q T Q. Therefore, R = I, since the only upper triangular matrix with positive diagonal elements is the identity matrix. Hence, we have M 1 (u 1,..., u n ) = H n (u n )... H 1 (u 1 ) = Q. When using m reflection vectors, the parametrisation can be expressed as W = M n m+1 (u n m+1,..., v n ) = H n (u n )... H n m+1 (u n m+1 ), (12) This parametrisation has the following advantages The parametrisation is smooth, which is convenient for training with gradient descent.
5 Algorithm 1 QR decomposition using the mappings {H k } n k=1. For a matrix R Rn n, {R k,k } 1 k n denote its diagonal elements, and R k..n,k R n k+1 is a vector representing the part of the k-th column below the k-th row (including the element of the k-th row ). Require: A R n n is a full-rank matrix. Ensure: Q and R where Q = H n (u n )... H 1 (u 1 ) and R and is upper triangular with positive diagonal elements such that A = QR R A Q I {Initialise Q to the identity matrix} for k = 1 to n 1 do if R k,k == R k..n,k then u n k+1 = [,...,, 1] T R n k+1 else u n k+1 R k..n,k R k..n,k [1,,..., ] T u n k+1 u n k+1 / u n k+1 end if R H n k+1 (u n k+1 )R Q QH n k+1 (u n k+1 ) end for u 1 = sgn(r n,n ) R R H 1 (u 1 )R Q QH 1 (u 1 ) When m = n, the transition matrix can span the whole orthogonal group O(n) The only thing needed for the matrix W to be orthogonal is for the vectors {u i } to have norm one. To ensure that this is always the case, one method would be to project the vectors onto the unit ball after each gradient step. This would take O(nm) time complexity. The time and space complexities involved in one gradient step are, in the worst case, the same as that of the simple RNN with the same number of hidden units. This is mainly due to the fact that the transition matrix W does not need to be generated explicitly. In fact, one only needs the matrix-vector products W v, which are cheaper to compute using our parametrisation. The parametrisation is flexible. A good trade-off between expressiveness and speed can be found by selecting the right number of reflection vectors. When the goal is to ensure that the transition matrix W can span the whole orthogonal group, the number of reflection vectors and hidden units be the same. In this case, the number of redundant parameters associated with the transition matrix W is small. In fact, when m = n the total number of parameters needed for W is n(n + 1)/2. There are only n redundant parameters in this case, since the orthogonal group O(n) is a n(n 1)/2 manifold. Note that H 1 (u 1 ) is not the standard Householder reflection. The second point of Theorem (i.e. the subjectivity) would not be true had we used a Householder reflection instead Time complexity When using the parametrisation in (12), we do not need to calculate the matrix W explicitly. We only need the matrix-vector products W v. The cost of calculating W v is proportional to O(nm), where m and n are the number of reflection vectors used and the size of the hidden state respectively. Consequently, given a sequence of length T, the time complexity of Forward Propagation (FP) is O(T mn). Back Propagation Through Time (BPTT) can be achieved within the same time complexity as FP. Note that for a simple RNN the complexity of FP and BPTT are equal to O(T n 2 ) for one sequence of length T. Therefore, training an RNN using our parametrisation incurs, in the worst case, the same time complexity as the simple RNN. The worst case being when the number of reflection vectors is equal to the size of the hidden state Space complexity Parametrising the transition matrix of an RNN using (12) with m reflections is equivalent to adding m 1 intermediate hidden states between any two consecutive time steps. For time step t, these vectors are h t,k = H n m+k (u n m+k )... H n m+1 (u n m+1 )h t, (13) for 1 k m 1, with the conventions h t,m = h t+1, = h t+1. Using an algorithm such as BPTT to calculate the gradients, requires the values of the hidden states, including the intermediate ones, in order to propagate the gradients. A naive approach would be to store the values of h t,k for all t T and 1 k m 1. This method would increase the memory storage by a factor of m compared to the case of a simple RNN. This may be acceptable if m is fixed irrespective of the size of the hidden state n. However, when m = n, it is clearly a problem. In order to circumvent this issue, note that the h t,k for 1 k m 1 do not need to be explicitly stored during the forward propagation. In fact, they can be calculated on the fly during BPTT, which would only add O(mn) operations at each time step. In fact, from Equation (13), the only ingredients needed to generate the intermediate hidden states at time step t are the reflection vectors {u i } m i=1 and the hidden state h t. Note that having to generate the intermediate vectors locally does not affect the overall time complexity of the BPTT algorithm. Adding O(mn) operations at each time step still results in an overall complexity
6 of O(T mn) for one sequence of length T. However, unless one implements the BPTT algorithm from scratch, it would be challenging to modify any pre-existing library to manage memory in the way needed for our parametrisation Compact WY representation In order to make the implementation of our method feasible we need to derive the expanded mathematical expressions of the following gradients with respect to some scalar loss function L [ ] T (W h) U = U (W h), (14) h = [ (W h) h ] T (W h), (15) where W is the transition matrix and U = (u n... u n m+1 ) is the matrix whose columns are the reflection vectors. Furthermore, we need to be able to compute these gradients in a O(mn) time complexity in order to achieve the desired global time and space complexities discussed in the previous subsections. Having the explicit expressions of U, h will allow an implementation that does not require the storage of any intermediate hidden states for BPTT. Note that in the case of an RNN, U represents the immediate gradient at any time step t. In which case, the hidden states {h k } k t are affected by small perturbations of U only through the hidden state h t. In order to write the mathematical expressions of the gradients in Equations (14) and (15), we will use the compact WY representation (Joffrain et al., 26) of the product of Householder Reflections, which is described by the following proposition. Proposition 1. (Joffrain et al., 26) Let n N and m n 1. For n m + 1 i n, let v i R n i+1 and U = (u n... u n m+1 ) be the matrix whose columns are the {v i } vectors. Then H n (u n ),..., H n m+1 (u n m+1 ) = I n UT 1 U T, (16) where T = striu(u T U) diag(u T U), (17) with striu(u T U), and diag(u T U) representing the strictly upper part and the diagonal of the matrix U T U, respectively. This is the compact WY representation. For the particular case of m = n, in Proposition 1. The compact WY representation can be written as H n (u n )... H 1 (u 1 ) = ( I n UT 1 U T ) H 1 (u 1 ), (18) where U = (u n... u 2 ), and H 1 (u 1 ) is defined in Equation (11). Theorem 2. Let m n 1 and W = I UT 1 U T, where U = (u n... u n m+1 ). Let J n R n n be the matrix of all ones, and L be a scalar loss function. We have [ U =U ( h C T ) B T + ( C h ] T ) B C h T h C T, (19) h =C U C, (2) where C = W h, C = C, C = T T U T C, h = T 1 U T h, B = striu(j n ) I n, and denote the Hadamard product. The gradients U, and h as defined in Equations (19) and (2) can be calculated in O(mn) time and space complexities. The proof of Equations (19) and (2) is provided in the appendix. Algorithms 2 and 3 show routines which can calculate T 1 v and T T v in time complexity O(mn) for any vector v. Algorithm 4 shows a routine which calculates U [ (vw T ) B T + (wv T ) B ] for any vectors v and w, also in time complexity O(mn). Using these routines the gradients U and h can be calculated in time complexity O(mn). Algorithm 2 Calculating T 1 v for any vector v. matrix U R n m, U,k denotes its k-th column. For a Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T 1 v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = m to 1 do o k 2(v k U T,k t)/ U,m t t + o k U,k end for Algorithm 3 Calculating T T v for any vector v. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v R m. Ensure: o = T T v, where o R m o [,..., ] T R m {initialising o to zeros} t [,..., ] T R n {temporary vector} for k = 1 to m do o k 2(v k U T,k t)/ U,m t t + o k U,k end for
7 Algorithm 4 Calculating U [ (vw T ) B T + (wv T ) B ] for any vector v and w in R m. For a matrix U R n m, U,k denotes its k-th column. Require: U = (u n... u n m+1 ) R n m and v, w R m. Ensure: Z = U [ (vw T ) B T + (wv T ) B ] R n m, where o R n Z n,m R n m {initialising Z with zeros} t [,..., ] T R n {temporary vector} l [,..., ] T R n {temporary vector} for k = 1 to m 1 do t t + U,k v k Z,k+1 Z,k+1 + w k+1 t l l + U,m k+1 w m k+1 Z,m k+1 Z,m k+1 + v m k+1 l end for l l + U,1 w 1 Z,1 Z,1 + v 1 l Orthogonal Parametrisation of Recurrent Neural Networks We fixed the length T = 1 of the input sequence and varied the number of hidden units n of the ornn in the set {128, 256, 512, 124, 248} (see Figure 1(a)). The number of reflection vectors was equal to the hidden size (i.e m = n). Our method scales in O(n 2 ), compared with O(n 3 ) for the naive approach as expected. Our method is faster than the later when n 25. Here, we fixed the length of the input sequence to T = 1, and the number of hidden units to n = 248 and we varied the number of reflections m = {128, 256, 512, 124, 248} (See Figure 1(b)). Our method scales in O(m). The last case consisted of fixing the number of hidden units and reflections to n = m = 128 and varying the length of the input sequence in {64, 128, 256, 512, 124, 248, 496}. The computational time increased linearly with the length of the sequence (See Figure 1(c)). This first experiment, confirms that, using our approach, BPTT can be performed in O(T mn) time complexity for an input sequence of length T. 5. Experiments We have implemented and tested our method using the python library theano (Theano Development Team, 216). We implemented a tensor operator F that takes a matrix W and a vector h and returns the product W h. In order to enable automatic differentiation, we also implemented a gradient method associated with the operator F which calculates the gradients in Equations 19 and 2, using Algorithms 2, 3, and Scalability In this first experiment we tested the scalability of our method by measuring the time required for one gradient calculation. The input consisted of one randomly generated 2d sequence. The target output was a fixed scalar value at the end of the sequence. We compared our method to a naive approach where the orthogonal matrix is generated explicitly before the FP using the compact representation W = I UT 1 U T. Note that even though the matrix T is triangular and its inverse can be calculated in time complexity O(m 2 ), generating W would still require O(nm 2 ) complexity because of the matrix-matrix products involved. Our method, on the other hand, does not require the explicit computation of W. We considered the following situations: 5.2. Application to problems with long-term dependencies We have tested the new parametrisation on the three different datasets all having long-term dependencies. We will refer to the RNN parametrised using the Householder reflections as the ornn. We set its activation function to the leaky Rectified Linear Unit (Relu) defined as { x, if x > φ(x) = (21).1x, otherwise For all experiments, we used the adam method for stochastic gradient descent. We tested different learning rates in the set {1 1, 1 2, 1 3, 1 4 } for every model and selected the ones which resulted in the fastest convergence on the training sets. We initialised all the parameters using uniform distributions similar to (Arjovsky et al., 215). The biases of all model were set to zero, except for the forget bias of the LSTM, which we set to 5 in order to facilitate the learning of long-term dependencies (Koutník et al., 214). To ensure that the transition matrix of the ornn is always orthogonal, we projected each reflection vector u i onto the unit i-sphere after each gradient iteration SEQUENCE GENERATION In this experiment, we followed a similar setting to (Koutník et al., 214) where we trained RNNs to encode song excerpts. We used the track Manyrista from album Musica Deposita by Cuprum. We extracted five consecutive excerpts around the beginning of the song, each having 8 data points and corresponding to 18ms with a 44.1Hz sampling frequency. We trained an srnn, LSTM,
8 NMSE Time t in seconds Time t in seconds Time t in seconds Orthogonal Parametrisation of Recurrent Neural Networks 1 2 Varying hidden size n 1 2 Varying # of reflections m A t ~ m Varying sequence length T 1 1 t ~ n 3! A t ~ n A t ~ m t ~ T! A t ~ T optimised naive n (a) Time required for one gradient step when the input is one two-dimensional sequence of length T = 1. The size of the hidden layer n takes values in {128, 256, 512, 124, 248}. 1 optimised naive m (b) Time required for one gradient step when the size of the hidden layer n and the length of the input sequence are fixed to 248 and 1 respectively. The number of reflection m takes values in {128, 256, 512, 124, 248}. 1-1 optimised T (c) Time required for one gradient step when the size of the hidden layer n is fixed to 128. The input is one 2d sequence of length T {64, 128, 256, 512, 124, 248, 496}. Figure 1. Scalability tests. and ornn for 5 epochs on each of the pieces with five random seeds. For each run, the lowest Normalised Mean Squared Error (NMSE) during the 5 epochs was recorded. For each model, we tested three different hidden sizes. The total number of parameters corresponding to these hidden sizes was approximately equal to 25, 5, and 1. For the ornn, we chose the number of reflection vectors to be equal to the hidden size for each case, so that the transition matrix is allowed to span the full orthogonal group. The results are shown in Figures 2 and 3. All the learning rate were set to.1. The orthogonal parametrisation outperformed the simple RNN and performed on average better than the LSTM on the sequence encoding task PIXEL MNIST We used the MNIST image dataset in this experiment. We split the dataset into train set (55 instances), validation (5 instances), and test set (1 instances). We trained an ornn with n = 128 hidden units and m = 16 reflections, which corresponds to approximately 36 parameters, to minimise the cross-entropy. We set the batch size to 64, and we selected the best model based on the lowest validation cost after 2 epochs. The later was evaluated at every 2 training iterations. All the learning rates were set to.1 in this experiment. Table compares our test performance against the urnn results available in the literature. The models we used for comparison all have about 16K parameters, which is more than four time the number of parameters of our tested model. Despite this, our model still achieved a 95.6% test accuracy which is the second best result LSTM # param ornn # param srnn # param Figure 3. Results summary for the sequence generation task. The plots show the NMSE distribution for the different models with respect to the total number of parameters. The horizontal red lines represent the medians of the NMSE over 25 data points (i.e. five seeds for each of the five song excerpts). The solid rectangles and the dashed bars represent the [25% 75%] and [9% 91%] confidence intervals respectively.
9 1 LSTM ornn srnn Figure 2. The RNN generated sequences against the true data for two of the five excerpts used. We only displayed the best performing models for to total number of parameters N p 1. This corresponds to the hidden sizes 15, 3, and 41 for the LSTM, srnn, and ornn respectively. The number of reflections was set to m = n for the ornn. The srnn showed the worst performance. Model hidden size Number of validation test (# reflections) parameters accuracy accuracy ornn 128 (m=16) 3.6K 96.8 % 95.6 % urnn (Arjovsky et al., 215) K % RC urnn (Jia, 216) K % FC urnn (Jia, 216) K % Table 2. Results summary for the MNIST digit classification experiment, and comparison with the urnn results available in the literature. FC and RC stand for Full-Capacity and Restricted Capacity respectively ADDING TASK In this experiment, we follow a similar setting to (Arjovsky et al., 215), where the goal of the RNN is to output the sum of two elements in the first dimension of a 2d sequence. The location of the two elements to be summed are specified by the entries in the second dimension of the input sequence. In particular, the first dimension of every input sequence consists of random numbers between and 1. The second dimension has all zeros except for two elements which are equal to 1. The first unit entry is located in the first half of the sequence, and the second one in the second half. We tested two different lags T = 4 and T = 8. All models were trained to minimise the Mean Squared Error (MSE). The baseline MSE for this task is.167 corresponding to a model that always outputs 1. We trained an ornn with n = 128 hidden units and m = 16 reflections similar to the MNIST experiment. We trained an LSTM and an srnn with hidden sizes 28 and 54, respectively, corresponding to a total number of parameters 36 (i.e. same as the ornn model). We chose a batch size of 5, and after each iteration, a new set of sequences was generated randomly. The learning rate for the ornn was set to.1. Figure 5(a) and 5(b) displays the results. The ornn was able to beat the baseline MSE in less than 5 iterations for both lags and for two different random initialisation seeds. This is in line with the results of the unitary RNN (Arjovsky et al., 215). 6. Discussion In this work, we presented a new efficient way to parametrise transition matrices of a recurrent neural network using Householder reflections. This method allows an easy and computationally efficient way to enforce an orthogonal constraint on the transition matrix which then ensures that exploding gradients do not occur during training. We have also shown that enforcing a unitary constraint on the transition matrix is a special case of the orthogonal one. Experimental results show that the orthogonal parametrisation has similar advantages to the unitary one. Acknowledgment The authors would like to acknowledge Department of State Growth Tasmania for funding this work through SenseT. References Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. arxiv preprint arxiv: , 215. Giles, Mike B. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. 28. Glorot, Xavier and Bengio, Yoshua. Understanding the dif-
10 Cross entropy MSE Accuracy MSE Orthogonal Parametrisation of Recurrent Neural Networks 1.8 Convergence plot (Accuracy).5.4 T = 4 LSTM srnn ornn ornn epochs (a) MNIST classification experiment. Validation accuracy of the ornn with n = 128 and m = 16 as a function of number of epochs Convergence plot (Cross entropy) ornn x 1 iterations (a) T = 8 LSTM srnn ornn epochs (b) MNIST classification experiment. Cross-entropy on the validation set of the ornn with n = 128 and m = 16 as a function of number of epochs. Figure 4. Convergence plots of the MNIST classification experiment x 1 iterations (b) Figure 5. Results of the addition task. For each lag T, the red curves represent two runs of the ornn model with two different random initialisation seeds. Notice that the LSTM and ornn did not beat the baseline MSE of.167.
11 ficulty of training deep feedforward neural networks. In Aistats, volume 9, pp , 21. Henaff, Mikael, Szlam, Arthur, and LeCun, Yann. Orthogonal rnns and long-memory tasks. arxiv preprint arxiv: , 216. Hochreiter, Sepp and Schmidhuber, Juergen. Long Short- Term Memory. 9(8): , Hochreiter, Sepp, Bengio, Yoshua, Frasconi, Paolo, and Schmidhuber, Jürgen. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 21. Hyland, Stephanie L and Rätsch, Gunnar. Learning unitary operators with help from u (n). arxiv preprint arxiv: , 216. Jia, Kui. Improving training of deep neural networks via singular value bounding. arxiv preprint arxiv: , 216. Joffrain, Thierry, Low, Tze Meng, Quintana-Ortí, Enrique S, Geijn, Robert van de, and Zee, Field G Van. Accumulating householder transformations, revisited. ACM Transactions on Mathematical Software (TOMS), 32(2): , 26. Koutník, Jan, Greff, Klaus, Gomez, Faustino J., and Schmidhuber, Jürgen. A clockwork RNN. CoRR, abs/ , 214. URL abs/ Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arxiv preprint arxiv: , 215. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. ICML (3), 28: , 213. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/ , May 216. URL arxiv.org/abs/ Wisdom, Scott, Powers, Thomas, Hershey, John R, Roux, Jonathan Le, and Atlas, Les. Full-capacity unitary recurrent neural networks. arxiv preprint arxiv: , 216. A. Proof of Theorem 2 Lemma 1. (Giles, 28) Let A, B, and C be real or complex matrices, such that C = f(a, B) where f is some differentiable mapping. Let s be some scalar quantity that is dependent on A, and B through C. Then we have the following identity Tr(C T dc) = Tr(A T da) + Tr(B T db), where da, db, and dc represent infinitesimal perturbations and X = s X, for X = A, B, C. Proof of Theorem 2. Let C = h UT 1 U T h where (U, h) R n m R n and T = striu(u T U)+ 1 2 diag(u T U). Notice that the matrix T can be written using the Hadamard product as follows T = B (U T U), where B = striu(j n ) I n and J n is the matrix of all ones. Calculating the infinitesimal perturbations of C gives dc =(I UT 1 U T )dh dut 1 U T h UT 1 du T h + UT 1 dt T 1 U T h. Using Equation (??) we can write dt = B (du T U + U T du). By substituting this back into the expression of dc, multiplying the left and right-hand sides by C T, and applying the trace we get Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(C T dut 1 U T h) Tr(C T UT 1 du T h) + Tr(C T UT 1 (B (du T U + U T du))t 1 U T h). Now using the identity Tr(AB) = Tr(BA), where the second dimension of A agrees with the first dimension of B, we can rearrange the expression of Tr(C T dc) as follows Tr(C T dc) = Tr(C T (I UT 1 U T )dh) Tr(T 1 U T hc T du) Tr(hC T UT 1 du T ) + Tr(T 1 U T hc T UT 1 (B (du T U + U T du))). To simplify the expression, we will use the short notations C = T T U T C, h = T 1 U T h,
12 Tr(C T dc) becomes Tr(C T dc) = Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr( h C T (B (du T U + U T du))). Now using the two following identities of the trace Tr(A T ) = Tr(A), Tr(A(B C)) = Tr((A B T )C)), we can rewrite Tr(C T dc) as follows Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr(h C T du T ) + Tr(( h C T B T )du T U) + Tr(( h C T B T )U T du). By rearranging and taking the transpose of the third and fourth term of the right-hand side we obtain Tr(C T dc) =Tr((C T C T U T )dh) Tr( hc T du) Tr( Ch T du) + Tr((( C h T ) B)U T du) + Tr((( h C T ) B T )U T du). Factorising by du inside the Tr we get Tr(C T dc) = Tr((C T C T U T )dh) [ T Tr (( hc + Ch T ( C h T ) B + ( h C ] ) ) T ) B T U T du. Using lemma 1 we conclude that [ U =U ( h C T ) B T + ( C h ] T ) B =C U C. h C h T h C T,
Learning Unitary Operators with Help from u(n)
@_hylandsl Learning Unitary Operators with Help from u(n) Stephanie L. Hyland 1,2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich 2 Tri-Institutional Training Program in Computational Biology
More informationarxiv: v1 [stat.ml] 29 Jul 2017
ORHOGONAL RECURREN NEURAL NEWORKS WIH SCALED CAYLEY RANSFORM Kyle E. Helfrich, Devin Willmott, & Qiang Ye Department of Mathematics University of Kentucky Lexington, KY 40506, USA {kyle.helfrich,devin.willmott,qye3}@uky.edu
More informationRecurrent Neural Network Training with Preconditioned Stochastic Gradient Descent
Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent 1 Xi-Lin Li, lixilinx@gmail.com arxiv:1606.04449v2 [stat.ml] 8 Dec 2016 Abstract This paper studies the performance of
More informationarxiv: v3 [stat.ml] 19 Jun 2018
arxiv:1707.09520v3 [stat.ml] 19 Jun 2018 Kyle E. Helfrich * 1 Devin Willmott * 1 Qiang Ye 1 Abstract Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer from vanishing or
More informationarxiv: v1 [stat.ml] 31 Oct 2016
Full-Capacity Unitary Recurrent Neural Networks arxiv:1611.00035v1 [stat.ml] 31 Oct 2016 Scott Wisdom 1, Thomas Powers 1, John R. Hershey 2, Jonathan Le Roux 2, and Les Atlas 1 1 Department of Electrical
More informationLearning Long-Term Dependencies with Gradient Descent is Difficult
Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.
More informationHigh Order LSTM/GRU. Wenjie Luo. January 19, 2016
High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long
More informationORTHOGONAL RECURRENT NEURAL NETWORKS
ORHOGONAL RECURREN NEURAL NEWORKS WIH SCALED CAYLEY RANSFORM Anonymous authors Paper under double-blind review ABSRAC Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationLong Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch
Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLong-Short Term Memory
Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationarxiv: v4 [cs.lg] 25 May 2016
arxiv:1511.06464v4 [cs.lg] 25 May 2016 Martin Arjovsky Amar Shah Yoshua Bengio Universidad de Buenos Aires, University of Cambridge, Université de Montréal. Yoshua Bengio is a CIFAR Senior Fellow. Indicates
More informationarxiv: v3 [cs.lg] 25 Oct 2017
Gated Orthogonal Recurrent Units: On Learning to Forget Li Jing 1, Caglar Gulcehre 2, John Peurifoy 1, Yichen Shen 1, Max Tegmark 1, Marin Soljačić 1, Yoshua Bengio 2 1 Massachusetts Institute of Technology,
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationEE-559 Deep learning LSTM and GRU
EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter
More informationarxiv: v3 [stat.ml] 10 Jan 2017
Learning Unitary Operators with Help From u(n) Stephanie L. Hyland 1, 2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich, Switzerland 2 Tri-Institutional Training Program in Computational
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationEE-559 Deep learning Recurrent Neural Networks
EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent
More informationLearning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials
Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems
More informationNeural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)
Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Introduction Neural Turning Machine: Couple a Neural Network with external memory resources The combined
More informationarxiv: v1 [stat.ml] 22 Nov 2016
Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery arxiv:1611.07252v1 [stat.ml] 22 Nov 2016 Scott Wisdom 1, Thomas Powers 1, James Pitton 1,2, and Les Atlas 1 1 Department of Electrical
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationIMPROVING PERFORMANCE OF RECURRENT NEURAL
IMPROVING PERFORMANCE OF RECURRENT NEURAL NETWORK WITH RELU NONLINEARITY Sachin S. Talathi & Aniket Vartak Qualcomm Research San Diego, CA 92121, USA {stalathi,avartak}@qti.qualcomm.com ABSTRACT In recent
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationarxiv: v2 [cs.lg] 30 May 2017
Kronecker Recurrent Units Cijo Jose Idiap Research Institute & EPFL cijo.jose@idiap.ch Moustapha Cissé Facebook AI Research moustaphacisse@fb.com arxiv:175.1142v2 [cs.lg] 3 May 217 François Fleuret Idiap
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationCAN RECURRENT NEURAL NETWORKS WARP TIME?
CAN RECURRENT NEURAL NETWORKS WARP TIME? Corentin Tallec Laboratoire de Recherche en Informatique Université Paris Sud Gif-sur-Yvette, 99, France corentin.tallec@u-psud.fr Yann Ollivier Facebook Artificial
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationFaster Training of Very Deep Networks Via p-norm Gates
Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:
More informationDeep Gate Recurrent Neural Network
JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi
More informationDifferential Topology Final Exam With Solutions
Differential Topology Final Exam With Solutions Instructor: W. D. Gillam Date: Friday, May 20, 2016, 13:00 (1) Let X be a subset of R n, Y a subset of R m. Give the definitions of... (a) smooth function
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLearning Long Term Dependencies with Gradient Descent is Difficult
Learning Long Term Dependencies with Gradient Descent is Difficult IEEE Trans. on Neural Networks 1994 Yoshua Bengio, Patrice Simard, Paolo Frasconi Presented by: Matt Grimes, Ayse Naz Erkan Recurrent
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single
More informationAnalysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function
Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More informationDifferentiation of functions of covariance
Differentiation of log X May 5, 2005 1 Differentiation of functions of covariance matrices or: Why you can forget you ever read this Richard Turner Covariance matrices are symmetric, but we often conveniently
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationStephen Scott.
1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological
More informationTracking the World State with Recurrent Entity Networks
Tracking the World State with Recurrent Entity Networks Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun Task At each timestep, get information (in the form of a sentence) about the
More informationMathematical Methods wk 2: Linear Operators
John Magorrian, magog@thphysoxacuk These are work-in-progress notes for the second-year course on mathematical methods The most up-to-date version is available from http://www-thphysphysicsoxacuk/people/johnmagorrian/mm
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationRECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS
2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,
More informationarxiv: v2 [cs.ne] 7 Apr 2015
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationarxiv: v1 [stat.ml] 18 Nov 2017
MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks arxiv:1711.06788v1 [stat.ml] 18 Nov 2017 Minmin Chen Google Mountain view, CA 94043 minminc@google.com Abstract We introduce
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationSAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra
SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to 1.1. Introduction Linear algebra is a specific branch of mathematics dealing with the study of vectors, vector spaces with functions that
More informationKRONECKER RECURRENT UNITS
KRONECKER RECURRENT UNITS Cijo Jose Idiap Research Institute & EPFL cijo.jose@idiap.ch Moustapha Cissé Facebook AI Research moustaphacisse@fb.com François Fleuret Idiap Research Institute & EPFL francois.fleuret@idiap.ch
More informationDeep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationOn the use of Long-Short Term Memory neural networks for time series prediction
On the use of Long-Short Term Memory neural networks for time series prediction Pilar Gómez-Gil National Institute of Astrophysics, Optics and Electronics ccc.inaoep.mx/~pgomez In collaboration with: J.
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationarxiv: v1 [cs.lg] 11 May 2015
Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationLecture 7 Artificial neural networks: Supervised learning
Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationSAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra
1.1. Introduction SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear algebra is a specific branch of mathematics dealing with the study of vectors, vector spaces with functions that
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationarxiv: v1 [cs.lg] 2 Feb 2018
Short-term Memory of Deep RNN Claudio Gallicchio arxiv:1802.00748v1 [cs.lg] 2 Feb 2018 Department of Computer Science, University of Pisa Largo Bruno Pontecorvo 3-56127 Pisa, Italy Abstract. The extension
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationRecurrent Residual Learning for Sequence Classification
Recurrent Residual Learning for Sequence Classification Yiren Wang University of Illinois at Urbana-Champaign yiren@illinois.edu Fei ian Microsoft Research fetia@microsoft.com Abstract In this paper, we
More informationMathematics. EC / EE / IN / ME / CE. for
Mathematics for EC / EE / IN / ME / CE By www.thegateacademy.com Syllabus Syllabus for Mathematics Linear Algebra: Matrix Algebra, Systems of Linear Equations, Eigenvalues and Eigenvectors. Probability
More informationImplicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks
Implicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks Carl-Johan Thore Linköping University - Division of Mechanics 581 83 Linköping - Sweden Abstract. Training
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationConvolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23 Content Quick Review of Recurrent Neural Network Introduction
More informationEquivalence Results between Feedforward and Recurrent Neural Networks for Sequences
Equivalence Results between Feedforward and Recurrent Neural Networks for Sequences Alessandro Sperduti Department of Mathematics University of Padova, Italy Email: sperduti@mathunipdit Abstract In the
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More information