arxiv: v3 [cs.lg] 9 Feb 2016

Size: px

Start display at page:

Download "arxiv: v3 [cs.lg] 9 Feb 2016"

Ralf Ball
6 years ago
Views:

1 NEURAL RANDOM-ACCESS MACHINES Karol Kurach & Marcin Andrychowicz & Ilya Sutskever Google ABSTRACT arxiv: v3 [cs.lg] 9 Feb 2016 In this aer, we roose and investigate a new neural network architecture called Neural Random Access Machine. It can maniulate and dereference ointers to an external variable-size random-access memory. The model is trained from ure inut-outut examles using backroagation. We evaluate the new model on a number of simle algorithmic tasks whose solutions require ointer maniulation and dereferencing. Our results show that the roosed model can learn to solve algorithmic tasks of such tye and is caable of oerating on simle data structures like linked-lists and binary trees. For easier tasks, the learned solutions generalize to sequences of arbitrary length. Moreover, memory access during inference can be done in a constant time under some assumtions. 1 INTRODUCTION Dee learning is successful for two reasons. First, dee neural networks are able to reresent the right kind of functions; second, dee neural networks are trainable. Dee neural networks can be otentially imroved if they get deeer and have fewer arameters, while maintaining trainability. By doing so, we move closer towards a ractical imlementation of Solomonoff induction (Solomonoff, 1964). The first model that we know of that attemted to train extremely dee networks with a large memory and few arameters is the Neural Turing Machine (NTM) (Graves et al., 2014) a comutationally universal dee neural network that is trainable with backroagation. Other models with this roerty include variants of Stack-Augmented recurrent neural networks (Joulin & Mikolov, 2015; Grefenstette et al., 2015), and the Grid-LSTM (Kalchbrenner et al., 2015) of which the Grid-LSTM has achieved the greatest success on both synthetic and real tasks. The key characteristic of these models is that their deth, the size of their short term memory, and their number of arameters are no longer confounded and can be altered indeendently which stands in contrast to models like the LSTM (Hochreiter & Schmidhuber, 1997), whose number of arameters grows quadratically with the size of their short term memory. A fundamental oeration of modern comuters is ointer maniulation and dereferencing. In this work, we investigate a model class that we name the Neural Random-Access Machine (NRAM), which is a neural network that has, as rimitive oerations, the ability to maniulate, store in memory, and dereference ointers into its working memory. By roviding our model with dereferencing as a rimitive, it becomes ossible to train models on roblems whose solutions require ointer maniulation and chasing. Although all comutationally universal neural networks are equivalent, which means that the NRAM model does not have a reresentational advantage over other models if they are given a sufficient number of comutational stes, in ractice, the number of timestes that a given model has is highly limited, as extremely dee models are very difficult to train. As a result, the model s core rimitives have a strong effect on the set of functions that can be feasibly learned in ractice, similarly to the way in which the choice of a rogramming language strongly affects the functions that can be imlemented with an extremely small amount of code. Finally, the usefulness of comutationally-universal neural networks deends entirely on the ability of backroagation to find good settings of their arameters. Indeed, it is trivial to define the otimal hyothesis class (Solomonoff, 1964), but the roblem of finding the best (or even a good) Equal contribution. 1

2 function in that class is intractable. Our work uts the backroagation algorithm to another test, where the model is extremely dee and intricate. In our exeriments, we evaluate our model on several algorithmic roblems whose solutions required ointer maniulation and chasing. These roblems include algorithms on a linked-list and a binary tree. While we were able to achieve encouraging results on these roblems, we found that standard otimization algorithms struggle with these extremely dee and nonlinear models. We believe that advances in otimization methods will likely lead to better results. 2 RELATED WORK There has been a significant interest in the roblem of learning algorithms in the ast few years. The most relevant recent aer is Neural Turing Machines (NTMs) (Graves et al., 2014). It was the first aer to exlicitly suggest the notion that it is worth training a comutationally universal neural network, and achieved encouraging results. A follow-u model that had the goal of learning algorithms was the Stack-Augmented Recurrent Neural Network (Joulin & Mikolov, 2015) This work demonstrated that the Stack-Augmented RNN can generalize to long roblem instances from short roblem instances. A related model is the Reinforcement Learning Neural Turing Machine (Zaremba & Sutskever, 2015), which attemted to use reinforcement learning techniques to train a discrete-continuous hybrid model. The memory network (Weston et al., 2014) is an early model that attemted to exlicitly searate the memory from comutation in a neural network model. The followu work of Sukhbaatar et al. (2015) combined the memory network with the soft attention mechanism, which allowed it to be trained with less suervision. The Grid-LSTM (Kalchbrenner et al., 2015) is a highly interesting extension of LSTM, which allows to use LSTM cells for both dee and sequential comutation. It achieves excellent results on both synthetic, algorithmic roblems and on real tasks, such as language modelling, machine translation, and object recognition. The Pointer Network (Vinyals et al., 2015) is somewhat different from the above models in that it does not have a writable memory it is more similar to the attention model of Bahdanau et al. (2014) in this regard. Desite not having a memory, this model was able to solve a number of difficult algorithmic roblems that include the convex hull and the aroximate 2D travelling salesman roblem (TSP). Finally, it is imortant to mention the attention model of Bahdanau et al. (2014). Although this work is not exlicitly aimed at learning algorithms, it is by far the most ractical model that has an algorithmic bent. Indeed, this model has roven to be highly versatile, and variants of this model have achieved state-of-the-art results on machine translation (Luong et al., 2015), seech recognition (Chan et al., 2015), and syntactic arsing (Vinyals et al., 2014), without the use of almost any domain-secific tuning. 3 MODEL In this section we describe the NRAM model. We start with a descrition of the simlified version of our model which does not use an external memory and then exlain how to augment it with a variable-size random-access memory. The core art of the model is a neural controller, which acts as a rocessor. The controller can be a feedforward neural network or an LSTM, and it is the only trainable art of the model. The model contains R registers, each of which holds an integer value. To make our model trainable with gradient descent, we made it fully differentiable. Hence, each register reresents an integer value with a distribution over the set {0, 1,..., M 1}, for some constant M. We do not assume that these distributions have any secial form they are simly stored as vectors R M satisfying i 0 and i i = 1. The controller does not have direct access to the registers; it can interact with them using a number of resecified modules (gates), such as integer addition or equality test. 2

3 Let s denote the modules m 1, m 2,..., m Q, where each module is a function: m i : {0, 1,..., M 1} {0, 1,..., M 1} {0, 1,..., M 1}. On a high level, the model erforms a sequence of timestes, each of which consists of the following substes: 1. The controller gets some inuts deending on the values of the registers (the controller s inuts are described in Sec. 3.1). 2. The controller udates its internal state (if the controller is an LSTM). 3. The controller oututs the descrition of a fuzzy circuit with inuts r 1,..., r R, gates m 1,..., m Q and R oututs. 4. The values of the registers are overwritten with the oututs of the circuit. More recisely, each circuit is created as follows. The inuts for the module m i are chosen by the controller from the set {r 1,..., r R, o 1,..., o i 1 }, where: r j is the value stored in the j-th register at the current timeste, and o j is the outut of the module m j at the current timeste. Hence, for each 1 i Q the controller chooses weighted averages of the values {r 1,..., r R, o 1,..., o i 1 } which are given as inuts to the module. Therefore, o i = m i ( (r1,..., r R, o 1,..., o i 1 ) T softmax(a i ), (r 1,..., r R, o 1,..., o i 1 ) T softmax(b i ) ), (1) where the vectors a i, b i R R+i 1 are roduced by the controller (Fig. 1). registers oututs of revious modules, r 1... r R o 1... o i 1 LSTM a i b i s-m s-m, m i o i Figure 1: The execution of the module m i. Gates s-m reresent the softmax function and, denotes inner roduct. See Eq. 1 for details. Recall that the variables r j reresent robability distributions and therefore the inuts to m i, being weighted averages of robability distributions, are also robability distributions. Thus, as the modules m i are originally defined for integer inuts and oututs, we must extend their domain to robability distributions as inuts, which can be done in a natural way (and make their outut also be a robability distribution): 0 c<m P (m i (A, B) = c) = P(A = a)p(b = b)[m i (a, b) = c]. (2) 0 a,b<m After the modules have roduced their oututs, the controller decides which of the values {r 1,..., r R, o 1,..., o Q } should be stored in the registers. In detail, the controller oututs the vectors c i R R+Q for 1 i R and the values of the registers are udated (simultaneously) using the formula: r i := (r 1,..., r R, o 1,..., o Q ) T softmax(c i ). (3) 3

4 3.1 CONTROLLER S INPUTS Recall that at the beginning of each timeste the controller receives some inuts, and it is an imortant design decision to decide where should these inuts come from. A naive aroach is to use the values of the registers as inuts to the controller. However, the values of the registers are robability distributions and are stored as vectors R M. If the entire distributions were given as inuts to the controller then the number of the model s arameters would deend on M. This would be undesirable because, as will be exlained in the next section, the value M is linked to the size of an external random-access memory tae and hence it would revent the model from generalizing to different memory sizes. Hence, for each 1 i R the controller receives, as inut, only one scalar from each register, namely P(r i = 0) the robability that the value in the register is equal 0. This solution has an additional advantage, namely it limits the amount of information available to the controller and forces it to rely on the modules instead of trying to solve the roblem on its own. Notice that this information is sufficient to get the exact value of r i if r i {0, 1}, which is the case whenever r i is an outut of a,,boolean module, e.g. the inequality test module m i (a, b) = [a < b]. 3.2 MEMORY TAPE One could use the model described so far for learning sequence-to-sequence transformations by initializing the registers with the inut sequence, and training the model to roduce the desired outut sequence in its registers after a given number of timestes. The disadvantage of such model is that it would be comletely unable to generalize to longer sequences, because the length of the sequence that the model can rocess is equal to the number of its registers, which is constant. Therefore, we extend the model with a variable-size memory tae, which consists of M memory cells, each of which stores a distribution over the set {0, 1,..., M 1}. Notice that each distribution stored in a memory cell or a register can be interreted as a fuzzy address in the memory and used as a fuzzy ointer. We will hence identify integers in the set {0, 1,..., M 1} with ointers to the memory. Therefore, the value in each memory cell may be interreted as an integer or as a ointer. The exact state of the memory can be described by a matrix M R M M, where the value M i,j is the robability that the i-th cell holds the value j. The model interacts with the memory tae solely using two secial modules: READ module: this module takes as the inut a ointer 1 and returns the value stored under the given address in the memory. This oeration is extended to fuzzy ointers similarly to Eq. 2. More recisely, if is a vector reresenting the robability distribution of the inut (i.e. i is the robability that the inut ointer oints to the i-th cell) then the module returns the value M T. WRITE module: this module takes as the inut a ointer and a value a and stores the value a under the address in the memory. The fuzzy form of the oeration can be effectively exressed using matrix oerations 2. The full architecture of the NRAM model is resented on Fig INPUTS AND OUTPUTS HANDLING The memory tae also serves as an inut-outut channel the model s memory is initialized with the inut sequence and the model is exected to roduce the outut in the memory. Moreover, we use a novel way of deciding how many timestes should be executed. After each timeste we let the controller decide whether it would like to continue the execution or finish it, in which case the current state of the memory is treated as the outut. 1 Formally each module takes two arguments. In this case the second argument is simly ignored. 2 The exact formula is M := (J )J T M + a T, where J denotes a (column) vector consisting of M ones and denotes coordinate-wise multilication. 4

5 binarized LSTM finish? registers r 1 r 2 r 3 r 4 m 1 m 2 m 3 r 1 r 2 r 3 r 4 memory tae Figure 2: One timeste of the NRAM architecture with R = 4 registers. The LSTM controller gets the,,binarized values r 1, r 2,... stored in the registers as inuts and oututs the descrition of the circuit in the grey box and the robability of finishing the execution in the current timeste (See Sec. 3.3 for more detail). The weights of the solid thin connections are oututted by the controller. The weights of the solid thick connections are trainable arameters of the model. Some of the modules (i.e. READ and WRITE) may interact with the memory tae (dashed connections). More recisely, after the timeste t the controller oututs a scalar f t [0, 1] 3, which denotes the willingness to finish the execution in the current timeste. Therefore, the robability that the execution has not been finished before the timeste t is equal t 1 i=1 (1 f i), and the robability that the outut is roduced exactly at the timeste t is equal t = f t t 1 i=1 (1 f i). There is also some maximal allowed number of timestes T, which is a hyerarameter. The model is forced to roduce outut in the last ste if it has not done it yet, i.e. T = 1 T 1 i=1 i regardless of the value f T. Let M (t) R M M denote the memory matrix after the timeste t, i.e. M(t) i,j is the robability that the i-th memory cell holds the value j after the timeste t. For an inut-outut air (x, y), where x, y {0, 1,..., M 1} M we define the loss of the model as the exected negative log-likelihood of roducing the correct outut, i.e., ( T t=1 t M ) i=1 log(m(t) i,y i ) assuming that the memory was initialized with the sequence x 4. Moreover, for all roblems we consider the outut sequence is shorter than the memory. Therefore, we comute the loss only over memory cells, which should contain the outut. 3.4 DISCRETIZATION Comuting the oututs of modules, reresented as robability distributions, is a comutationally costly oeration. For examle, comuting the outut of the READ module takes Θ(M 2 ) time as it requires the multilication of the matrix M R M M and the vector RM. One may however susect (and we emirically verify this claim in Sec. 4) that the NRAM model naturally learns solutions in which the distributions of intermediate values have very low entroy. The argument for this hyothesis is that fuzziness in the intermediate values would robably roagate to the outut and cause a higher value of the cost function. To test this hyothesis we trained the model and then used its discretized version during interference. In the discretized version every module gets as inuts the values from modules (or registers), which are the most robable to roduce 3 In fact, the controller oututs a scalar x i and f i = sigmoid(x i). 4 One could also use the negative log-likelihood of the exected outut, i.e. ( M i=1 log T ) t=1 t M(t) i,y i as the loss function. 5

6 the given inut accordingly to the distribution oututted by the controller. More recisely, it corresonds to relacing the function softmax in equations (1,3) with the function returning the vector containing 1 on the osition of the maximum value in the inut and zeros on all other ositions. Notice that in the discretized NRAM model each register and memory cell stores an integer from the set {0, 1,..., M 1} and therefore all modules may be executed efficiently (assuming that the functions reresented by the modules can be efficiently comuted). In case of a feedforward controller and a small (e.g. 20) number of registers the interference can be accelerated even further. Recall that the only inuts to the controller are binarized values of the register. Therefore, instead of executing the controller one may simle recomute the (discretized) controller s outut for each configuration of the registers binarized values. Such algorithm would enjoy an extremely efficient imlementation in machine code. 4 EXPERIMENTS 4.1 TRAINING PROCEDURE The NRAM model is fully differentiable and we trained it using the Adam otimization algorithm (Kingma & Ba, 2014) with the negative log-likelihood cost function. Notice that we do not use any additional suervised data (such as memory access traces) beyond ure inut-outut examles. We used multilayer ercetrons (MLPs) with two hidden layers or LSTMs with a hidden layer between inut and LSTM cells as controllers. The number of hidden units in each layer was equal. The ReLu nonlinearity (Nair & Hinton, 2010) was used in all exeriments. Below are some imortant techniques that we used in the training: Curriculum learning As noticed in several aers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning is crucial for training dee networks on very comlicated roblems. We followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifications. The details can be found in Aendix B. Gradient cliing Notice that the deth of the unfolded execution is roughly a roduct of the number of timestes and the number of modules. Even for moderately small exeriments (e.g. 14 modules and 20 timestes) this value easily exceeds a few hundreds. In networks of such deth, the gradients can often exlode (Bengio et al., 1994), what makes training by backroagation much harder. We noticed that the gradients w.r.t. the intermediate values inside the backroagation were so large, that they sometimes led to an overflow in single-recision floating-oint arithmetic. Therefore, we clied the gradients w.r.t. the activations, within the execution of the backroagation algorithm. More recisely, each coordinate is searately croed into the range [ C 1, C 1 ] for some constant C 1. Before udating arameters, we also globally rescale the whole gradient vector, so that its L2 norm is not bigger than some constant value C 2. Noise We added random Gaussian noise to the comuted gradients after the backroagation ste. The variance of this noise decays exonentially during the training. The details can be found in Neelakantan et al. (2015). Enforcing Distribution Constraints For very dee networks, a small error in one lace can roagate to a huge error in some other lace. This was the case with our ointers: they are robability distributions over memory cells and they should sum u to 1. However, after a number of oerations are alied, they can accumulate error as a result of inaccurate floating-oint arithmetic. We have a secial layer which is resonsible for rescaling all values (multilying by the inverse of their sum), to make sure they always reresent a robability distribution. We add this layer to our model in a few critical laces (eg. after the softmax oeration) 5. 5 We do not however backroagate through these renormalizing oerations, i.e. during the backward ass we simly assume that they are identities. 6

7 Entroy While searching for a solution, the network can fix the ointer distribution on some articular value. This is advantageous at the end of training, because ideally we would like to be able to discretize the model. However, if this haens at the begin of the training, it could force the network to stay in a local minimum, with a small chance of moving the robability mass to some other value. To address this roblem, we encourage the network to exlore the sace of solutions by adding an entroy bonus, that decreases over time. More recisely, for every distribution oututted by the controller, we subtract from the cost function the entroy of the distribution multilied by some coefficient, which decreases exonentially during the training. Limiting the values of logarithms There are two laces in our model where the logarithms are comuted in the cost function and in the entroy comutation. Inuts to whose logarithms can be very small numbers, which may cause very big values of the cost function or even overflows in floating-oint arithmetic. To revent this henomenon we use log(max(x, ɛ)) instead of log(x) for some small hyerarameter ɛ whenever a logarithm is comuted. 4.2 TASKS We now describe the tasks used in our exeriments. For every task, the inut is given to the network in the memory tae, and the network s goal is to modify the memory according to the task s secification. We allow the network to modify the original inut. The final error for a test case is comuted as c m, where c is the number of correctly written cells, and m reresents the total number of cells that should be modified. Due to limited sace, we describe the tasks only briefly here. The detailed memory layout of inuts and oututs can be found in the Aendix A. 1. Access Given a value k and an array A, return A[k]. 2. Increment Given an array, increment all its elements by Coy Given an array and a ointer to the destination, coy all elements from the array to the given location. 4. Reverse Given an array and a ointer to the destination, coy all elements from the array in reversed order. 5. Swa Given two ointers, q and an array A, swa elements A[] and A[q]. 6. Permutation Given two arrays of n elements: P (contains a ermutation of numbers 1,..., n) and A (contains random elements), ermutate A according to P. 7. ListK Given a ointer to the head of a linked list and a number k, find the value of the k-th element on the list. 8. ListSearch Given a ointer to the head of a linked list and a value v to find return a ointer to the first node on the list with the value v. 9. Merge Given ointers to 2 sorted arrays A and B, merge them. 10. WalkBST Given a ointer to the root of a Binary Search Tree, and a ath to be traversed (sequence of left/right stes), return the element at the end of the ath. 4.3 MODULES In all of our exeriments we used the same sequence of 14 modules: READ (described in Sec. 3.2), ZERO(a, b) = 0, ONE(a, b) = 1, TWO(a, b) = 2, INC(a, b) = (a+1) mod M, ADD(a, b) = (a+b) mod M, SUB(a, b) = (a b) mod M, DEC(a, b) = (a 1) mod M, LESS-THAN(a, b) = [a < b], LESS-OR-EQUAL-THAN(a, b) = [a b], EQUALITY-TEST(a, b) = [a = b], MIN(a, b) = min(a, b), MAX(a, b) = max(a, b), WRITE (described in Sec. 3.2). We also considered settings in which the module sequence is reeated many times, e.g. there are 28 modules, where modules number 1. and 15. are READ, modules number 2. and 16. are ZERO and so on. The number of reetitions is a hyerarameter. 7

8 Task Train Comlexity Train error Generalization Discretization Access len(a) 20 0 erfect erfect Increment len(a) 15 0 erfect erfect Coy len(a) 15 0 erfect erfect Reverse len(a) 15 0 erfect erfect Swa len(a) 20 0 erfect erfect Permutation len(a) 6 0 almost erfect erfect ListK len(list) 10 0 strong hurts erformance ListSearch len(list) 6 0 weak hurts erformance Merge len(a) + len(b) 10 1% weak hurts erformance WalkBST size(tree) % strong hurts erformance Table 1: Results of the exeriments. The erfect generalization error means that the tested roblem had error 0 for comlexity u to 50. Exact generalization errors are resented in Fig. 3 The erfect discretization means that the discretized version of the model roduced exactly the same oututs as the original model on all test cases Merge WalkBST ListK ListSearch Permutation Test error Max task comlexity Figure 3: Generalization errors for hard tasks. The Permutation and ListSearch roblems were trained only u to comlexity 6. The remaining roblems were trained u to comlexity 10. The horizontal axis denotes the maximal task comlexity, i.e., x = 20 denotes results with comlexity samled uniformly from the interval [1, 20]. 4.4 RESULTS Overall, we were able to find arameters that achieved an error 0 for all roblems excet Merge and WalkBST (where we got an error of 1%). As described in 4.2, our metric is an accuracy on the memory cells that should be modified. To comute it, we take the continuous memory state roduced by our network, then discretize it (every cell will contain the value with the highest robability), and finally comare with the exected outut. The results of the exeriments are summarized in Table 1. Below we describe our results on all 10 tasks in more detail. We divide them into 2 categories: easy and hard tasks. Easy tasks is a category of tasks that achieved low error scores for many sets of arameters and we did not have to send much time trying to tune them. First 5 roblems from our task list belong to this category. Hard tasks, on the other hand, are roblems that often trained to low error rate only in a very small number of cases, eg. 1 out of EASY TASKS This category includes the following roblems: Access, Increment, Coy, Reverse, Swa. For all of them we were able to find many sets of hyerarameters that achieved error 0, or close to it without much effort. 8

9 Ste r 1 r 2 r 3 r 4 READ WRITE :0 :0 a: :1 :6 a: :1 :6 a: :2 :7 a: :2 :7 a: :3 :8 a: :3 :8 a: :4 :9 a: :4 :9 a: :5 :10 a: :5 :10 a:9 Table 2: State of memory and registers for the Coy roblem at the start of every timeste. We also show the arguments given to the READ and WRITE functions in each timeste. The argument : reresents the source/destination address and a: reresents the value to be written (for WRITE). The value 6 at osition 0 in the memory is the ointer to the destination array. It is followed by 5 values (gray columns) that should be coied. We also tested how those solutions generalize to longer inut sequences. To do this, for every roblem we selected a model that achieved error 0 during the training, and tested it on inuts with lengths u to To erform these tests we also increased the memory size and the number of allowed timestes. In all cases the model solved the roblem erfectly, what shows that it generalizes not only to longer inut sequences, but also to different memory sizes and numbers of allowed timestes. Moreover, the discretized version of the model (see Sec. 3.4 for details) also solves all the roblems erfectly. These results show that the NRAM model naturally learns algorithmic solutions, which generalize well. We were also interested if the found solutions generalize to sequences of arbitrary length. It is easiest to verify in the case of a discretized model with a feedforward controller. That is because then circuits oututted by the controller deend solely on the values of registers, which are integers. We manually analysed circuits for roblems Coy and Increment and verified that found solutions generalize to inuts of arbitrary length, assuming that the number of allowed timestes is aroriate HARD TASKS This category includes: Permutation, ListK, ListSearch, Merge and WalkBST. For all of them we had to erform an extensive random search to find a good set of hyerarameters. Usually, most of the arameter combinations were stuck on the starting curriculum level with a high error of 50% 70%. For the first 3 tasks we managed to train the network to achieve error 0. For WalkBST and Merge the training errors were 0.3% and 1% resectively. For training those roblems we had to introduce additional techniques described in Sec For Permutation, ListK and WalkBST our model generalizes very well and achieves low error rates on inuts at least twice longer than the ones seen during the training. The exact generalization errors are shown in Fig. 3. The only hard roblem on which our model discretizes well is Permutation on this task r 3 r 1 r 4 r 2 inc r 1 ' r 3 ' read add r 2 ' min a write Figure 4: The circuit generated at every timeste 2. The values of the ointer () for READ, WRITE and the value to be written (a) for WRITE are resented in Table 2. The modules whose oututs are not used were removed from the icture. r 4 ' 6 Unfortunately we could not test for lengths longer than 50 due to the memory restrictions. 9

10 the discretized version of the model roduces exactly the same oututs as the original model on all cases tested. For the remaining four roblems the discretized version of the models erform very oorly (error rates 70%). We believe that better results may be obtained by using some techniques encouraging discretization during the training 7. We noticed that the training rocedure is very unstable and the error often raises from a few ercents to e.g. 70% in just one eoch. Moreover, even if we use the best found set of hyerarameters, the ercent of random seeds that converges to error 0 was usually equal about 11%. We observed that the ercent of converging seeds is much lower if we do not add noise to the gradient in this case only about 1% of seeds converge. 4.5 COMPARISON TO EXISTING MODELS A comarison to other models is challenging because we are the first to consider roblems with ointers. The NTM can solve tasks like Coy or Reverse, but it suffers from the inability to naturally store a ointer to a fixed location in the memory. This makes it unlikely that it could solve tasks such as ListK, ListSearch or WalkBST since the ointers used in these tasks refer to absolute ositions. What distinguishes our model from most of the revious attemts (including NTMs, Memory Networks, Pointer Networks) is the lack of content-based addressing. It was a deliberate design decision, since this kind of addressing inherently slows down the memory access. In contrast, our model if discretized can access the memory in a constant time. The NRAM is also the first model that we are aware of emloying a differentiable mechanism for deciding when to finish the comutation. 4.6 EXEMPLARY EXECUTION We resent one examle execution of our model for the roblem Coy. For the examle, we use a very small model with 12 memory cells, 4 registers and the standard set of 14 modules. The controller for this model is a feedforward network, and we run it for 11 timestes. Table 2 contains, for every timeste, the state of the memory and registers at the begin of the timeste. The model can execute different circuits at different timestes. In articular, we observed that the first circuit is slightly different from the rest, since it needs to handle the initialization. Starting from the second ste all generated circuits are the same. We resent this circuit in Fig. 4. The register r 2 is constant and kees the offset between the destination array and the source array (6 1 = 5 in this case). The register r 3 is resonsible for incrementing the ointer in the source array. Its value is coied to r 4 8, the register used by the READ module. For the WRITE module, it also uses r 4 which is shifted by r 2. The register r 1 is unused. This solution generalizes to sequences of arbitrary length. 5 CONCLUSIONS In this aer we resented the Neural Random-Access Machine, which can learn to solve roblems that require exlicit maniulation and dereferencing of ointers. We showed that this model can learn to solve a number of algorithmic roblems and generalize well to inuts longer than ones seen during the training. In articular, for some roblems it generalizes to inuts of arbitrary length. However, we noticed that the otimization roblem resulting from the backroagating through the execution trace of the rogram is very challenging for standard otimization techniques. It seems likely that a method that can search in an easier abstract sace would be more effective at solving such roblems. 7 One could for examle add at later stages of training a enalty roortional to the entroy of the intermediate values of registers/memory. 8 In our case r 3 < r 2, so the MIN module always oututs the value r It is not satisfied in the last timeste, but then the array is already coied. 10

11 REFERENCES Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arxiv rerint arxiv: , Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term deendencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2): , Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, ACM, Chan, William, Jaitly, Navdee, Le, Quoc V, and Vinyals, Oriol. Listen, attend and sell. arxiv rerint arxiv: , Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arxiv rerint arxiv: , Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to transduce with unbounded memory. arxiv rerint arxiv: , Hochreiter, Se and Schmidhuber, Jürgen. Long short-term memory. Neural comutation, 9(8): , Joulin, Armand and Mikolov, Tomas. Inferring algorithmic atterns with stack-augmented recurrent nets. arxiv rerint arxiv: , Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. arxiv rerint arxiv: , Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic otimization. arxiv rerint arxiv: , Luong, Minh-Thang, Pham, Hieu, and Manning, Christoher D. Effective aroaches to attentionbased neural machine translation. arxiv rerint arxiv: , Nair, Vinod and Hinton, Geoffrey E. Rectified linear units imrove restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), , Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol, and Martens, James. Adding gradient noise imroves learning for very dee networks. arxiv rerint arxiv: , Solomonoff, Ray J. A formal theory of inductive inference. art i. Information and control, 7(1): 1 22, Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory networks. arxiv rerint arxiv: , Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey. Grammar as a foreign language. arxiv rerint arxiv: , Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdee. Pointer networks. arxiv rerint arxiv: , Weston, Jason, Chora, Sumit, and Bordes, Antoine. Memory networks. arxiv rerint arxiv: , Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arxiv rerint arxiv: , Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arxiv rerint arxiv: ,

12 A DETAILED TASKS DESCRIPTIONS In this section we describe in details the memory layout of inuts and oututs for the tasks used in our exeriments. In all descritions below, big letters reresent arrays and small letters reresents ointers. NULL denotes the value 0 and is used to mark the end of an array or a missing next element in a list or a binary tree. 1. Access Given a value k and an array A, return A[k]. Inut is given as k, A[0],.., A[n 1], NULL and the network should relace the first memory cell with A[k]. 2. Increment Given an array A, increment all its elements by 1. Inut is given as A[0],..., A[n 1], NULL and the exected outut is A[0] + 1,..., A[n 1] Coy Given an array and a ointer to the destination, coy all elements from the array to the given location. Inut is given as, A[0],..., A[n 1] where oints to one element after A[n 1]. The exected outut is A[0],..., A[n 1] at ositions,..., +n 1 resectively. 4. Reverse Given an array and a ointer to the destination, coy all elements from the array in reversed order. Inut is given as, A[0],..., A[n 1] where oints one element after A[n 1]. The exected outut is A[n 1],..., A[0] at ositions,..., +n 1 resectively. 5. Swa Given two ointers, q and an array A, swa elements A[] and A[q]. Inut is given as, q, A[0],.., A[],..., A[q],..., A[n 1], 0. The exected modified array A is: A[0],..., A[q],..., A[],..., A[n 1]. 6. Permutation Given two arrays of n elements: P (contains a ermutation of numbers 0,..., n 1) and A (contains random elements), ermutate A according to P. Inut is given as a, P [0],..., P [n 1], A[0],..., A[n 1], where a is a ointer to the array A. The exected outut is A[P [0]],..., A[P [n 1]], which should override the array P. 7. ListK Given a ointer to the head of a linked list and a number k, find the value of the k-th element on the list. List nodes are reresented as two adjacent memory cells: a ointer to the next node and a value. Elements are in random locations in the memory, so that the network needs to follow the ointers to find the correct element. Inut is given as: head, k, out,... where head is a ointer to the first node on the list, k indicates how many hos are needed and out is a cell where the outut should be ut. 8. ListSearch Given a ointer to the head of a linked list and a value v to find return a ointer to the first node on the list with the value v. The list is laced in memory in the same way as in the task ListK. We fill emty memory with trash values to revent the network from cheating and just iterating over the whole memory. 9. Merge Given ointers to 2 sorted arrays A and B, and the ointer to the outut o, merge the two arrays into one sorted array. The inut is given as: a, b, o, A[0],.., A[n 1], G, B[0],..., B[m 1], G, where G is a secial guardian value, a and b oint to the first elements of arrays A and B resectively, and o oints to the address after the second G. The n + m element should be written in correct order starting from osition o. 10. WalkBST Given a ointer to the root of a Binary Search Tree, and a ath to be traversed, return the element at the end of the ath. The BST nodes are reresented as tries (v, l, r), where v is the value, and l, r are ointers to the left/right child. The triles are laced randomly in the memory. Inut is given as root, out, d 1, d 2,..., d k, NULL,..., where root oints to the root node and out is a slot for the outut. The sequence d 1...d k, d i {0, 1} reresents the ath to be traversed: d i = 0 means that the network should go to the left child, d i = 1 reresents going to the right child. 12

13 B DETAILS OF CURRICULUM TRAINING As noticed in several aers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning is crucial for training dee networks on very comlicated roblems. We followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifications. For each of the tasks we have manually defined a sequence of subtasks with increasing difficulty, where the difficulty is usually measured by the length of the inut sequence. During training the inut-outut examles are samled from a distribution that is determined by the current difficulty level D. The level is increased (u to some maximal value) whenever the error rate of the model goes below some threshold. Moreover, we ensure that successive increases of D are searated by some number of batches. In more detail, to generate an inut-outut examle we first samle a difficulty d from a distribution determined by the current level D and then draw the examle with the difficulty d. The rocedure for samling d is the following: with robability 10%: ick d uniformly at random from the set of all ossible difficulties; with robability 25%: ick d uniformly from [1, D + e], where e is a samle from a geometric distribution with a success robability 1/2; with robability 65%: set d = D + e, where e is samled as above. Notice that the above rocedure guarantees that every difficulty d can be icked regardless of the current level D, which has been shown to increase erformance Zaremba & Sutskever (2014). 13

14 C EXAMPLE CIRCUITS Below are resented examle circuits generated during training for all simle tasks (excet Coy which was resented in the aer). For modules READ and WRITE, the value of the first argument (ointer to the address to be read/written) is marked as. For WRITE, the value to be written is marked as a and the value returned by this module is always 0. For modules LESS-THAN and LESS-OR-EQUAL-THAN the first arameter is marked as x and the second one as y. Other modules either have only one arameter or the order of arameters is not imortant. For all tasks below (excet Increment), the circuit generated at timeste 1 is different than circuits generated at stes 2, which are the same. This is because the first circuit needs to handle the initialization. We resent only the main circuits generated for timestes 2. C.1 ACCESS 0 y read inc x lt a min write r 2 ' r 1 r 1 ' Figure 5: The circuit generated at every timeste 2 for the task Access. Ste r 1 r Table 3: Memory for task Access. Only the first memory cell is modified. 14

15 C.2 INCREMENT r 3 ' a write r 5 read inc max r 4 ' r 5 ' r 2 ' 1 add min r 1 ' Figure 6: The circuit generated at every timeste for the task Increment. Ste r 1 r 2 r 3 r 4 r Table 4: Memory for task Increment. 15

16 C.3 REVERSE r 4 x r 2 ' 1 y le r 4 ' r 1 ' r 1 add sub r 3 inc r 3 ' dec a write read min max Figure 7: The circuit generated at every timeste 2 for the task Reverse. Ste r 1 r 2 r 3 r Table 5: Memory for task Reverse. 16

17 C.4 SWAP For swa we observed that 2 different circuits are generated, one for even timestes, one for odd timestes. r 2 ' r 1 read read max a max write a add write r 1 ' r 2 Figure 8: The circuit generated at every even timeste for the task Swa. r 1 read max r 1 ' a write a write r 2 eq sub add x y lt 1 2 sub y inc x lt x le r 2 ' y 0 Figure 9: The circuit generated at every odd timeste 3 for the task Swa. Ste r 1 r Table 6: Memory for task Swa. 17

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules