arxiv: v3 [cs.lg] 9 Feb 2016

Size: px
Start display at page:

Download "arxiv: v3 [cs.lg] 9 Feb 2016"

Transcription

1 NEURAL RANDOM-ACCESS MACHINES Karol Kurach & Marcin Andrychowicz & Ilya Sutskever Google ABSTRACT arxiv: v3 [cs.lg] 9 Feb 2016 In this aer, we roose and investigate a new neural network architecture called Neural Random Access Machine. It can maniulate and dereference ointers to an external variable-size random-access memory. The model is trained from ure inut-outut examles using backroagation. We evaluate the new model on a number of simle algorithmic tasks whose solutions require ointer maniulation and dereferencing. Our results show that the roosed model can learn to solve algorithmic tasks of such tye and is caable of oerating on simle data structures like linked-lists and binary trees. For easier tasks, the learned solutions generalize to sequences of arbitrary length. Moreover, memory access during inference can be done in a constant time under some assumtions. 1 INTRODUCTION Dee learning is successful for two reasons. First, dee neural networks are able to reresent the right kind of functions; second, dee neural networks are trainable. Dee neural networks can be otentially imroved if they get deeer and have fewer arameters, while maintaining trainability. By doing so, we move closer towards a ractical imlementation of Solomonoff induction (Solomonoff, 1964). The first model that we know of that attemted to train extremely dee networks with a large memory and few arameters is the Neural Turing Machine (NTM) (Graves et al., 2014) a comutationally universal dee neural network that is trainable with backroagation. Other models with this roerty include variants of Stack-Augmented recurrent neural networks (Joulin & Mikolov, 2015; Grefenstette et al., 2015), and the Grid-LSTM (Kalchbrenner et al., 2015) of which the Grid-LSTM has achieved the greatest success on both synthetic and real tasks. The key characteristic of these models is that their deth, the size of their short term memory, and their number of arameters are no longer confounded and can be altered indeendently which stands in contrast to models like the LSTM (Hochreiter & Schmidhuber, 1997), whose number of arameters grows quadratically with the size of their short term memory. A fundamental oeration of modern comuters is ointer maniulation and dereferencing. In this work, we investigate a model class that we name the Neural Random-Access Machine (NRAM), which is a neural network that has, as rimitive oerations, the ability to maniulate, store in memory, and dereference ointers into its working memory. By roviding our model with dereferencing as a rimitive, it becomes ossible to train models on roblems whose solutions require ointer maniulation and chasing. Although all comutationally universal neural networks are equivalent, which means that the NRAM model does not have a reresentational advantage over other models if they are given a sufficient number of comutational stes, in ractice, the number of timestes that a given model has is highly limited, as extremely dee models are very difficult to train. As a result, the model s core rimitives have a strong effect on the set of functions that can be feasibly learned in ractice, similarly to the way in which the choice of a rogramming language strongly affects the functions that can be imlemented with an extremely small amount of code. Finally, the usefulness of comutationally-universal neural networks deends entirely on the ability of backroagation to find good settings of their arameters. Indeed, it is trivial to define the otimal hyothesis class (Solomonoff, 1964), but the roblem of finding the best (or even a good) Equal contribution. 1

2 function in that class is intractable. Our work uts the backroagation algorithm to another test, where the model is extremely dee and intricate. In our exeriments, we evaluate our model on several algorithmic roblems whose solutions required ointer maniulation and chasing. These roblems include algorithms on a linked-list and a binary tree. While we were able to achieve encouraging results on these roblems, we found that standard otimization algorithms struggle with these extremely dee and nonlinear models. We believe that advances in otimization methods will likely lead to better results. 2 RELATED WORK There has been a significant interest in the roblem of learning algorithms in the ast few years. The most relevant recent aer is Neural Turing Machines (NTMs) (Graves et al., 2014). It was the first aer to exlicitly suggest the notion that it is worth training a comutationally universal neural network, and achieved encouraging results. A follow-u model that had the goal of learning algorithms was the Stack-Augmented Recurrent Neural Network (Joulin & Mikolov, 2015) This work demonstrated that the Stack-Augmented RNN can generalize to long roblem instances from short roblem instances. A related model is the Reinforcement Learning Neural Turing Machine (Zaremba & Sutskever, 2015), which attemted to use reinforcement learning techniques to train a discrete-continuous hybrid model. The memory network (Weston et al., 2014) is an early model that attemted to exlicitly searate the memory from comutation in a neural network model. The followu work of Sukhbaatar et al. (2015) combined the memory network with the soft attention mechanism, which allowed it to be trained with less suervision. The Grid-LSTM (Kalchbrenner et al., 2015) is a highly interesting extension of LSTM, which allows to use LSTM cells for both dee and sequential comutation. It achieves excellent results on both synthetic, algorithmic roblems and on real tasks, such as language modelling, machine translation, and object recognition. The Pointer Network (Vinyals et al., 2015) is somewhat different from the above models in that it does not have a writable memory it is more similar to the attention model of Bahdanau et al. (2014) in this regard. Desite not having a memory, this model was able to solve a number of difficult algorithmic roblems that include the convex hull and the aroximate 2D travelling salesman roblem (TSP). Finally, it is imortant to mention the attention model of Bahdanau et al. (2014). Although this work is not exlicitly aimed at learning algorithms, it is by far the most ractical model that has an algorithmic bent. Indeed, this model has roven to be highly versatile, and variants of this model have achieved state-of-the-art results on machine translation (Luong et al., 2015), seech recognition (Chan et al., 2015), and syntactic arsing (Vinyals et al., 2014), without the use of almost any domain-secific tuning. 3 MODEL In this section we describe the NRAM model. We start with a descrition of the simlified version of our model which does not use an external memory and then exlain how to augment it with a variable-size random-access memory. The core art of the model is a neural controller, which acts as a rocessor. The controller can be a feedforward neural network or an LSTM, and it is the only trainable art of the model. The model contains R registers, each of which holds an integer value. To make our model trainable with gradient descent, we made it fully differentiable. Hence, each register reresents an integer value with a distribution over the set {0, 1,..., M 1}, for some constant M. We do not assume that these distributions have any secial form they are simly stored as vectors R M satisfying i 0 and i i = 1. The controller does not have direct access to the registers; it can interact with them using a number of resecified modules (gates), such as integer addition or equality test. 2

3 Let s denote the modules m 1, m 2,..., m Q, where each module is a function: m i : {0, 1,..., M 1} {0, 1,..., M 1} {0, 1,..., M 1}. On a high level, the model erforms a sequence of timestes, each of which consists of the following substes: 1. The controller gets some inuts deending on the values of the registers (the controller s inuts are described in Sec. 3.1). 2. The controller udates its internal state (if the controller is an LSTM). 3. The controller oututs the descrition of a fuzzy circuit with inuts r 1,..., r R, gates m 1,..., m Q and R oututs. 4. The values of the registers are overwritten with the oututs of the circuit. More recisely, each circuit is created as follows. The inuts for the module m i are chosen by the controller from the set {r 1,..., r R, o 1,..., o i 1 }, where: r j is the value stored in the j-th register at the current timeste, and o j is the outut of the module m j at the current timeste. Hence, for each 1 i Q the controller chooses weighted averages of the values {r 1,..., r R, o 1,..., o i 1 } which are given as inuts to the module. Therefore, o i = m i ( (r1,..., r R, o 1,..., o i 1 ) T softmax(a i ), (r 1,..., r R, o 1,..., o i 1 ) T softmax(b i ) ), (1) where the vectors a i, b i R R+i 1 are roduced by the controller (Fig. 1). registers oututs of revious modules, r 1... r R o 1... o i 1 LSTM a i b i s-m s-m, m i o i Figure 1: The execution of the module m i. Gates s-m reresent the softmax function and, denotes inner roduct. See Eq. 1 for details. Recall that the variables r j reresent robability distributions and therefore the inuts to m i, being weighted averages of robability distributions, are also robability distributions. Thus, as the modules m i are originally defined for integer inuts and oututs, we must extend their domain to robability distributions as inuts, which can be done in a natural way (and make their outut also be a robability distribution): 0 c<m P (m i (A, B) = c) = P(A = a)p(b = b)[m i (a, b) = c]. (2) 0 a,b<m After the modules have roduced their oututs, the controller decides which of the values {r 1,..., r R, o 1,..., o Q } should be stored in the registers. In detail, the controller oututs the vectors c i R R+Q for 1 i R and the values of the registers are udated (simultaneously) using the formula: r i := (r 1,..., r R, o 1,..., o Q ) T softmax(c i ). (3) 3

4 3.1 CONTROLLER S INPUTS Recall that at the beginning of each timeste the controller receives some inuts, and it is an imortant design decision to decide where should these inuts come from. A naive aroach is to use the values of the registers as inuts to the controller. However, the values of the registers are robability distributions and are stored as vectors R M. If the entire distributions were given as inuts to the controller then the number of the model s arameters would deend on M. This would be undesirable because, as will be exlained in the next section, the value M is linked to the size of an external random-access memory tae and hence it would revent the model from generalizing to different memory sizes. Hence, for each 1 i R the controller receives, as inut, only one scalar from each register, namely P(r i = 0) the robability that the value in the register is equal 0. This solution has an additional advantage, namely it limits the amount of information available to the controller and forces it to rely on the modules instead of trying to solve the roblem on its own. Notice that this information is sufficient to get the exact value of r i if r i {0, 1}, which is the case whenever r i is an outut of a,,boolean module, e.g. the inequality test module m i (a, b) = [a < b]. 3.2 MEMORY TAPE One could use the model described so far for learning sequence-to-sequence transformations by initializing the registers with the inut sequence, and training the model to roduce the desired outut sequence in its registers after a given number of timestes. The disadvantage of such model is that it would be comletely unable to generalize to longer sequences, because the length of the sequence that the model can rocess is equal to the number of its registers, which is constant. Therefore, we extend the model with a variable-size memory tae, which consists of M memory cells, each of which stores a distribution over the set {0, 1,..., M 1}. Notice that each distribution stored in a memory cell or a register can be interreted as a fuzzy address in the memory and used as a fuzzy ointer. We will hence identify integers in the set {0, 1,..., M 1} with ointers to the memory. Therefore, the value in each memory cell may be interreted as an integer or as a ointer. The exact state of the memory can be described by a matrix M R M M, where the value M i,j is the robability that the i-th cell holds the value j. The model interacts with the memory tae solely using two secial modules: READ module: this module takes as the inut a ointer 1 and returns the value stored under the given address in the memory. This oeration is extended to fuzzy ointers similarly to Eq. 2. More recisely, if is a vector reresenting the robability distribution of the inut (i.e. i is the robability that the inut ointer oints to the i-th cell) then the module returns the value M T. WRITE module: this module takes as the inut a ointer and a value a and stores the value a under the address in the memory. The fuzzy form of the oeration can be effectively exressed using matrix oerations 2. The full architecture of the NRAM model is resented on Fig INPUTS AND OUTPUTS HANDLING The memory tae also serves as an inut-outut channel the model s memory is initialized with the inut sequence and the model is exected to roduce the outut in the memory. Moreover, we use a novel way of deciding how many timestes should be executed. After each timeste we let the controller decide whether it would like to continue the execution or finish it, in which case the current state of the memory is treated as the outut. 1 Formally each module takes two arguments. In this case the second argument is simly ignored. 2 The exact formula is M := (J )J T M + a T, where J denotes a (column) vector consisting of M ones and denotes coordinate-wise multilication. 4

5 binarized LSTM finish? registers r 1 r 2 r 3 r 4 m 1 m 2 m 3 r 1 r 2 r 3 r 4 memory tae Figure 2: One timeste of the NRAM architecture with R = 4 registers. The LSTM controller gets the,,binarized values r 1, r 2,... stored in the registers as inuts and oututs the descrition of the circuit in the grey box and the robability of finishing the execution in the current timeste (See Sec. 3.3 for more detail). The weights of the solid thin connections are oututted by the controller. The weights of the solid thick connections are trainable arameters of the model. Some of the modules (i.e. READ and WRITE) may interact with the memory tae (dashed connections). More recisely, after the timeste t the controller oututs a scalar f t [0, 1] 3, which denotes the willingness to finish the execution in the current timeste. Therefore, the robability that the execution has not been finished before the timeste t is equal t 1 i=1 (1 f i), and the robability that the outut is roduced exactly at the timeste t is equal t = f t t 1 i=1 (1 f i). There is also some maximal allowed number of timestes T, which is a hyerarameter. The model is forced to roduce outut in the last ste if it has not done it yet, i.e. T = 1 T 1 i=1 i regardless of the value f T. Let M (t) R M M denote the memory matrix after the timeste t, i.e. M(t) i,j is the robability that the i-th memory cell holds the value j after the timeste t. For an inut-outut air (x, y), where x, y {0, 1,..., M 1} M we define the loss of the model as the exected negative log-likelihood of roducing the correct outut, i.e., ( T t=1 t M ) i=1 log(m(t) i,y i ) assuming that the memory was initialized with the sequence x 4. Moreover, for all roblems we consider the outut sequence is shorter than the memory. Therefore, we comute the loss only over memory cells, which should contain the outut. 3.4 DISCRETIZATION Comuting the oututs of modules, reresented as robability distributions, is a comutationally costly oeration. For examle, comuting the outut of the READ module takes Θ(M 2 ) time as it requires the multilication of the matrix M R M M and the vector RM. One may however susect (and we emirically verify this claim in Sec. 4) that the NRAM model naturally learns solutions in which the distributions of intermediate values have very low entroy. The argument for this hyothesis is that fuzziness in the intermediate values would robably roagate to the outut and cause a higher value of the cost function. To test this hyothesis we trained the model and then used its discretized version during interference. In the discretized version every module gets as inuts the values from modules (or registers), which are the most robable to roduce 3 In fact, the controller oututs a scalar x i and f i = sigmoid(x i). 4 One could also use the negative log-likelihood of the exected outut, i.e. ( M i=1 log T ) t=1 t M(t) i,y i as the loss function. 5

6 the given inut accordingly to the distribution oututted by the controller. More recisely, it corresonds to relacing the function softmax in equations (1,3) with the function returning the vector containing 1 on the osition of the maximum value in the inut and zeros on all other ositions. Notice that in the discretized NRAM model each register and memory cell stores an integer from the set {0, 1,..., M 1} and therefore all modules may be executed efficiently (assuming that the functions reresented by the modules can be efficiently comuted). In case of a feedforward controller and a small (e.g. 20) number of registers the interference can be accelerated even further. Recall that the only inuts to the controller are binarized values of the register. Therefore, instead of executing the controller one may simle recomute the (discretized) controller s outut for each configuration of the registers binarized values. Such algorithm would enjoy an extremely efficient imlementation in machine code. 4 EXPERIMENTS 4.1 TRAINING PROCEDURE The NRAM model is fully differentiable and we trained it using the Adam otimization algorithm (Kingma & Ba, 2014) with the negative log-likelihood cost function. Notice that we do not use any additional suervised data (such as memory access traces) beyond ure inut-outut examles. We used multilayer ercetrons (MLPs) with two hidden layers or LSTMs with a hidden layer between inut and LSTM cells as controllers. The number of hidden units in each layer was equal. The ReLu nonlinearity (Nair & Hinton, 2010) was used in all exeriments. Below are some imortant techniques that we used in the training: Curriculum learning As noticed in several aers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning is crucial for training dee networks on very comlicated roblems. We followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifications. The details can be found in Aendix B. Gradient cliing Notice that the deth of the unfolded execution is roughly a roduct of the number of timestes and the number of modules. Even for moderately small exeriments (e.g. 14 modules and 20 timestes) this value easily exceeds a few hundreds. In networks of such deth, the gradients can often exlode (Bengio et al., 1994), what makes training by backroagation much harder. We noticed that the gradients w.r.t. the intermediate values inside the backroagation were so large, that they sometimes led to an overflow in single-recision floating-oint arithmetic. Therefore, we clied the gradients w.r.t. the activations, within the execution of the backroagation algorithm. More recisely, each coordinate is searately croed into the range [ C 1, C 1 ] for some constant C 1. Before udating arameters, we also globally rescale the whole gradient vector, so that its L2 norm is not bigger than some constant value C 2. Noise We added random Gaussian noise to the comuted gradients after the backroagation ste. The variance of this noise decays exonentially during the training. The details can be found in Neelakantan et al. (2015). Enforcing Distribution Constraints For very dee networks, a small error in one lace can roagate to a huge error in some other lace. This was the case with our ointers: they are robability distributions over memory cells and they should sum u to 1. However, after a number of oerations are alied, they can accumulate error as a result of inaccurate floating-oint arithmetic. We have a secial layer which is resonsible for rescaling all values (multilying by the inverse of their sum), to make sure they always reresent a robability distribution. We add this layer to our model in a few critical laces (eg. after the softmax oeration) 5. 5 We do not however backroagate through these renormalizing oerations, i.e. during the backward ass we simly assume that they are identities. 6

7 Entroy While searching for a solution, the network can fix the ointer distribution on some articular value. This is advantageous at the end of training, because ideally we would like to be able to discretize the model. However, if this haens at the begin of the training, it could force the network to stay in a local minimum, with a small chance of moving the robability mass to some other value. To address this roblem, we encourage the network to exlore the sace of solutions by adding an entroy bonus, that decreases over time. More recisely, for every distribution oututted by the controller, we subtract from the cost function the entroy of the distribution multilied by some coefficient, which decreases exonentially during the training. Limiting the values of logarithms There are two laces in our model where the logarithms are comuted in the cost function and in the entroy comutation. Inuts to whose logarithms can be very small numbers, which may cause very big values of the cost function or even overflows in floating-oint arithmetic. To revent this henomenon we use log(max(x, ɛ)) instead of log(x) for some small hyerarameter ɛ whenever a logarithm is comuted. 4.2 TASKS We now describe the tasks used in our exeriments. For every task, the inut is given to the network in the memory tae, and the network s goal is to modify the memory according to the task s secification. We allow the network to modify the original inut. The final error for a test case is comuted as c m, where c is the number of correctly written cells, and m reresents the total number of cells that should be modified. Due to limited sace, we describe the tasks only briefly here. The detailed memory layout of inuts and oututs can be found in the Aendix A. 1. Access Given a value k and an array A, return A[k]. 2. Increment Given an array, increment all its elements by Coy Given an array and a ointer to the destination, coy all elements from the array to the given location. 4. Reverse Given an array and a ointer to the destination, coy all elements from the array in reversed order. 5. Swa Given two ointers, q and an array A, swa elements A[] and A[q]. 6. Permutation Given two arrays of n elements: P (contains a ermutation of numbers 1,..., n) and A (contains random elements), ermutate A according to P. 7. ListK Given a ointer to the head of a linked list and a number k, find the value of the k-th element on the list. 8. ListSearch Given a ointer to the head of a linked list and a value v to find return a ointer to the first node on the list with the value v. 9. Merge Given ointers to 2 sorted arrays A and B, merge them. 10. WalkBST Given a ointer to the root of a Binary Search Tree, and a ath to be traversed (sequence of left/right stes), return the element at the end of the ath. 4.3 MODULES In all of our exeriments we used the same sequence of 14 modules: READ (described in Sec. 3.2), ZERO(a, b) = 0, ONE(a, b) = 1, TWO(a, b) = 2, INC(a, b) = (a+1) mod M, ADD(a, b) = (a+b) mod M, SUB(a, b) = (a b) mod M, DEC(a, b) = (a 1) mod M, LESS-THAN(a, b) = [a < b], LESS-OR-EQUAL-THAN(a, b) = [a b], EQUALITY-TEST(a, b) = [a = b], MIN(a, b) = min(a, b), MAX(a, b) = max(a, b), WRITE (described in Sec. 3.2). We also considered settings in which the module sequence is reeated many times, e.g. there are 28 modules, where modules number 1. and 15. are READ, modules number 2. and 16. are ZERO and so on. The number of reetitions is a hyerarameter. 7

8 Task Train Comlexity Train error Generalization Discretization Access len(a) 20 0 erfect erfect Increment len(a) 15 0 erfect erfect Coy len(a) 15 0 erfect erfect Reverse len(a) 15 0 erfect erfect Swa len(a) 20 0 erfect erfect Permutation len(a) 6 0 almost erfect erfect ListK len(list) 10 0 strong hurts erformance ListSearch len(list) 6 0 weak hurts erformance Merge len(a) + len(b) 10 1% weak hurts erformance WalkBST size(tree) % strong hurts erformance Table 1: Results of the exeriments. The erfect generalization error means that the tested roblem had error 0 for comlexity u to 50. Exact generalization errors are resented in Fig. 3 The erfect discretization means that the discretized version of the model roduced exactly the same oututs as the original model on all test cases Merge WalkBST ListK ListSearch Permutation Test error Max task comlexity Figure 3: Generalization errors for hard tasks. The Permutation and ListSearch roblems were trained only u to comlexity 6. The remaining roblems were trained u to comlexity 10. The horizontal axis denotes the maximal task comlexity, i.e., x = 20 denotes results with comlexity samled uniformly from the interval [1, 20]. 4.4 RESULTS Overall, we were able to find arameters that achieved an error 0 for all roblems excet Merge and WalkBST (where we got an error of 1%). As described in 4.2, our metric is an accuracy on the memory cells that should be modified. To comute it, we take the continuous memory state roduced by our network, then discretize it (every cell will contain the value with the highest robability), and finally comare with the exected outut. The results of the exeriments are summarized in Table 1. Below we describe our results on all 10 tasks in more detail. We divide them into 2 categories: easy and hard tasks. Easy tasks is a category of tasks that achieved low error scores for many sets of arameters and we did not have to send much time trying to tune them. First 5 roblems from our task list belong to this category. Hard tasks, on the other hand, are roblems that often trained to low error rate only in a very small number of cases, eg. 1 out of EASY TASKS This category includes the following roblems: Access, Increment, Coy, Reverse, Swa. For all of them we were able to find many sets of hyerarameters that achieved error 0, or close to it without much effort. 8

9 Ste r 1 r 2 r 3 r 4 READ WRITE :0 :0 a: :1 :6 a: :1 :6 a: :2 :7 a: :2 :7 a: :3 :8 a: :3 :8 a: :4 :9 a: :4 :9 a: :5 :10 a: :5 :10 a:9 Table 2: State of memory and registers for the Coy roblem at the start of every timeste. We also show the arguments given to the READ and WRITE functions in each timeste. The argument : reresents the source/destination address and a: reresents the value to be written (for WRITE). The value 6 at osition 0 in the memory is the ointer to the destination array. It is followed by 5 values (gray columns) that should be coied. We also tested how those solutions generalize to longer inut sequences. To do this, for every roblem we selected a model that achieved error 0 during the training, and tested it on inuts with lengths u to To erform these tests we also increased the memory size and the number of allowed timestes. In all cases the model solved the roblem erfectly, what shows that it generalizes not only to longer inut sequences, but also to different memory sizes and numbers of allowed timestes. Moreover, the discretized version of the model (see Sec. 3.4 for details) also solves all the roblems erfectly. These results show that the NRAM model naturally learns algorithmic solutions, which generalize well. We were also interested if the found solutions generalize to sequences of arbitrary length. It is easiest to verify in the case of a discretized model with a feedforward controller. That is because then circuits oututted by the controller deend solely on the values of registers, which are integers. We manually analysed circuits for roblems Coy and Increment and verified that found solutions generalize to inuts of arbitrary length, assuming that the number of allowed timestes is aroriate HARD TASKS This category includes: Permutation, ListK, ListSearch, Merge and WalkBST. For all of them we had to erform an extensive random search to find a good set of hyerarameters. Usually, most of the arameter combinations were stuck on the starting curriculum level with a high error of 50% 70%. For the first 3 tasks we managed to train the network to achieve error 0. For WalkBST and Merge the training errors were 0.3% and 1% resectively. For training those roblems we had to introduce additional techniques described in Sec For Permutation, ListK and WalkBST our model generalizes very well and achieves low error rates on inuts at least twice longer than the ones seen during the training. The exact generalization errors are shown in Fig. 3. The only hard roblem on which our model discretizes well is Permutation on this task r 3 r 1 r 4 r 2 inc r 1 ' r 3 ' read add r 2 ' min a write Figure 4: The circuit generated at every timeste 2. The values of the ointer () for READ, WRITE and the value to be written (a) for WRITE are resented in Table 2. The modules whose oututs are not used were removed from the icture. r 4 ' 6 Unfortunately we could not test for lengths longer than 50 due to the memory restrictions. 9

10 the discretized version of the model roduces exactly the same oututs as the original model on all cases tested. For the remaining four roblems the discretized version of the models erform very oorly (error rates 70%). We believe that better results may be obtained by using some techniques encouraging discretization during the training 7. We noticed that the training rocedure is very unstable and the error often raises from a few ercents to e.g. 70% in just one eoch. Moreover, even if we use the best found set of hyerarameters, the ercent of random seeds that converges to error 0 was usually equal about 11%. We observed that the ercent of converging seeds is much lower if we do not add noise to the gradient in this case only about 1% of seeds converge. 4.5 COMPARISON TO EXISTING MODELS A comarison to other models is challenging because we are the first to consider roblems with ointers. The NTM can solve tasks like Coy or Reverse, but it suffers from the inability to naturally store a ointer to a fixed location in the memory. This makes it unlikely that it could solve tasks such as ListK, ListSearch or WalkBST since the ointers used in these tasks refer to absolute ositions. What distinguishes our model from most of the revious attemts (including NTMs, Memory Networks, Pointer Networks) is the lack of content-based addressing. It was a deliberate design decision, since this kind of addressing inherently slows down the memory access. In contrast, our model if discretized can access the memory in a constant time. The NRAM is also the first model that we are aware of emloying a differentiable mechanism for deciding when to finish the comutation. 4.6 EXEMPLARY EXECUTION We resent one examle execution of our model for the roblem Coy. For the examle, we use a very small model with 12 memory cells, 4 registers and the standard set of 14 modules. The controller for this model is a feedforward network, and we run it for 11 timestes. Table 2 contains, for every timeste, the state of the memory and registers at the begin of the timeste. The model can execute different circuits at different timestes. In articular, we observed that the first circuit is slightly different from the rest, since it needs to handle the initialization. Starting from the second ste all generated circuits are the same. We resent this circuit in Fig. 4. The register r 2 is constant and kees the offset between the destination array and the source array (6 1 = 5 in this case). The register r 3 is resonsible for incrementing the ointer in the source array. Its value is coied to r 4 8, the register used by the READ module. For the WRITE module, it also uses r 4 which is shifted by r 2. The register r 1 is unused. This solution generalizes to sequences of arbitrary length. 5 CONCLUSIONS In this aer we resented the Neural Random-Access Machine, which can learn to solve roblems that require exlicit maniulation and dereferencing of ointers. We showed that this model can learn to solve a number of algorithmic roblems and generalize well to inuts longer than ones seen during the training. In articular, for some roblems it generalizes to inuts of arbitrary length. However, we noticed that the otimization roblem resulting from the backroagating through the execution trace of the rogram is very challenging for standard otimization techniques. It seems likely that a method that can search in an easier abstract sace would be more effective at solving such roblems. 7 One could for examle add at later stages of training a enalty roortional to the entroy of the intermediate values of registers/memory. 8 In our case r 3 < r 2, so the MIN module always oututs the value r It is not satisfied in the last timeste, but then the array is already coied. 10

11 REFERENCES Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arxiv rerint arxiv: , Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term deendencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2): , Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, ACM, Chan, William, Jaitly, Navdee, Le, Quoc V, and Vinyals, Oriol. Listen, attend and sell. arxiv rerint arxiv: , Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arxiv rerint arxiv: , Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to transduce with unbounded memory. arxiv rerint arxiv: , Hochreiter, Se and Schmidhuber, Jürgen. Long short-term memory. Neural comutation, 9(8): , Joulin, Armand and Mikolov, Tomas. Inferring algorithmic atterns with stack-augmented recurrent nets. arxiv rerint arxiv: , Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. arxiv rerint arxiv: , Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic otimization. arxiv rerint arxiv: , Luong, Minh-Thang, Pham, Hieu, and Manning, Christoher D. Effective aroaches to attentionbased neural machine translation. arxiv rerint arxiv: , Nair, Vinod and Hinton, Geoffrey E. Rectified linear units imrove restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), , Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol, and Martens, James. Adding gradient noise imroves learning for very dee networks. arxiv rerint arxiv: , Solomonoff, Ray J. A formal theory of inductive inference. art i. Information and control, 7(1): 1 22, Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory networks. arxiv rerint arxiv: , Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey. Grammar as a foreign language. arxiv rerint arxiv: , Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdee. Pointer networks. arxiv rerint arxiv: , Weston, Jason, Chora, Sumit, and Bordes, Antoine. Memory networks. arxiv rerint arxiv: , Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arxiv rerint arxiv: , Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arxiv rerint arxiv: ,

12 A DETAILED TASKS DESCRIPTIONS In this section we describe in details the memory layout of inuts and oututs for the tasks used in our exeriments. In all descritions below, big letters reresent arrays and small letters reresents ointers. NULL denotes the value 0 and is used to mark the end of an array or a missing next element in a list or a binary tree. 1. Access Given a value k and an array A, return A[k]. Inut is given as k, A[0],.., A[n 1], NULL and the network should relace the first memory cell with A[k]. 2. Increment Given an array A, increment all its elements by 1. Inut is given as A[0],..., A[n 1], NULL and the exected outut is A[0] + 1,..., A[n 1] Coy Given an array and a ointer to the destination, coy all elements from the array to the given location. Inut is given as, A[0],..., A[n 1] where oints to one element after A[n 1]. The exected outut is A[0],..., A[n 1] at ositions,..., +n 1 resectively. 4. Reverse Given an array and a ointer to the destination, coy all elements from the array in reversed order. Inut is given as, A[0],..., A[n 1] where oints one element after A[n 1]. The exected outut is A[n 1],..., A[0] at ositions,..., +n 1 resectively. 5. Swa Given two ointers, q and an array A, swa elements A[] and A[q]. Inut is given as, q, A[0],.., A[],..., A[q],..., A[n 1], 0. The exected modified array A is: A[0],..., A[q],..., A[],..., A[n 1]. 6. Permutation Given two arrays of n elements: P (contains a ermutation of numbers 0,..., n 1) and A (contains random elements), ermutate A according to P. Inut is given as a, P [0],..., P [n 1], A[0],..., A[n 1], where a is a ointer to the array A. The exected outut is A[P [0]],..., A[P [n 1]], which should override the array P. 7. ListK Given a ointer to the head of a linked list and a number k, find the value of the k-th element on the list. List nodes are reresented as two adjacent memory cells: a ointer to the next node and a value. Elements are in random locations in the memory, so that the network needs to follow the ointers to find the correct element. Inut is given as: head, k, out,... where head is a ointer to the first node on the list, k indicates how many hos are needed and out is a cell where the outut should be ut. 8. ListSearch Given a ointer to the head of a linked list and a value v to find return a ointer to the first node on the list with the value v. The list is laced in memory in the same way as in the task ListK. We fill emty memory with trash values to revent the network from cheating and just iterating over the whole memory. 9. Merge Given ointers to 2 sorted arrays A and B, and the ointer to the outut o, merge the two arrays into one sorted array. The inut is given as: a, b, o, A[0],.., A[n 1], G, B[0],..., B[m 1], G, where G is a secial guardian value, a and b oint to the first elements of arrays A and B resectively, and o oints to the address after the second G. The n + m element should be written in correct order starting from osition o. 10. WalkBST Given a ointer to the root of a Binary Search Tree, and a ath to be traversed, return the element at the end of the ath. The BST nodes are reresented as tries (v, l, r), where v is the value, and l, r are ointers to the left/right child. The triles are laced randomly in the memory. Inut is given as root, out, d 1, d 2,..., d k, NULL,..., where root oints to the root node and out is a slot for the outut. The sequence d 1...d k, d i {0, 1} reresents the ath to be traversed: d i = 0 means that the network should go to the left child, d i = 1 reresents going to the right child. 12

13 B DETAILS OF CURRICULUM TRAINING As noticed in several aers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning is crucial for training dee networks on very comlicated roblems. We followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifications. For each of the tasks we have manually defined a sequence of subtasks with increasing difficulty, where the difficulty is usually measured by the length of the inut sequence. During training the inut-outut examles are samled from a distribution that is determined by the current difficulty level D. The level is increased (u to some maximal value) whenever the error rate of the model goes below some threshold. Moreover, we ensure that successive increases of D are searated by some number of batches. In more detail, to generate an inut-outut examle we first samle a difficulty d from a distribution determined by the current level D and then draw the examle with the difficulty d. The rocedure for samling d is the following: with robability 10%: ick d uniformly at random from the set of all ossible difficulties; with robability 25%: ick d uniformly from [1, D + e], where e is a samle from a geometric distribution with a success robability 1/2; with robability 65%: set d = D + e, where e is samled as above. Notice that the above rocedure guarantees that every difficulty d can be icked regardless of the current level D, which has been shown to increase erformance Zaremba & Sutskever (2014). 13

14 C EXAMPLE CIRCUITS Below are resented examle circuits generated during training for all simle tasks (excet Coy which was resented in the aer). For modules READ and WRITE, the value of the first argument (ointer to the address to be read/written) is marked as. For WRITE, the value to be written is marked as a and the value returned by this module is always 0. For modules LESS-THAN and LESS-OR-EQUAL-THAN the first arameter is marked as x and the second one as y. Other modules either have only one arameter or the order of arameters is not imortant. For all tasks below (excet Increment), the circuit generated at timeste 1 is different than circuits generated at stes 2, which are the same. This is because the first circuit needs to handle the initialization. We resent only the main circuits generated for timestes 2. C.1 ACCESS 0 y read inc x lt a min write r 2 ' r 1 r 1 ' Figure 5: The circuit generated at every timeste 2 for the task Access. Ste r 1 r Table 3: Memory for task Access. Only the first memory cell is modified. 14

15 C.2 INCREMENT r 3 ' a write r 5 read inc max r 4 ' r 5 ' r 2 ' 1 add min r 1 ' Figure 6: The circuit generated at every timeste for the task Increment. Ste r 1 r 2 r 3 r 4 r Table 4: Memory for task Increment. 15

16 C.3 REVERSE r 4 x r 2 ' 1 y le r 4 ' r 1 ' r 1 add sub r 3 inc r 3 ' dec a write read min max Figure 7: The circuit generated at every timeste 2 for the task Reverse. Ste r 1 r 2 r 3 r Table 5: Memory for task Reverse. 16

17 C.4 SWAP For swa we observed that 2 different circuits are generated, one for even timestes, one for odd timestes. r 2 ' r 1 read read max a max write a add write r 1 ' r 2 Figure 8: The circuit generated at every even timeste for the task Swa. r 1 read max r 1 ' a write a write r 2 eq sub add x y lt 1 2 sub y inc x lt x le r 2 ' y 0 Figure 9: The circuit generated at every odd timeste 3 for the task Swa. Ste r 1 r Table 6: Memory for task Swa. 17

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Recent Developments in Multilayer Perceptron Neural Networks

Recent Developments in Multilayer Perceptron Neural Networks Recent Develoments in Multilayer Percetron eural etworks Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, Texas 75265 walter.delashmit@lmco.com walter.delashmit@verizon.net Michael

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

arxiv: v1 [cs.lg] 4 May 2015

arxiv: v1 [cs.lg] 4 May 2015 Reinforcement Learning Neural Turing Machines arxiv:1505.00521v1 [cs.lg] 4 May 2015 Wojciech Zaremba 1,2 NYU woj.zaremba@gmail.com Abstract Ilya Sutskever 2 Google ilyasu@google.com The expressive power

More information

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert).

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert). LIMITATIONS OF RECEPTRON XOR Problem The failure of the ercetron to successfully simle roblem such as XOR (Minsky and Paert). x y z x y z 0 0 0 0 0 0 Fig. 4. The exclusive-or logic symbol and function

More information

Distributed Rule-Based Inference in the Presence of Redundant Information

Distributed Rule-Based Inference in the Presence of Redundant Information istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

arxiv: v3 [cs.lg] 12 Jan 2016

arxiv: v3 [cs.lg] 12 Jan 2016 REINFORCEMENT LEARNING NEURAL TURING MACHINES - REVISED Wojciech Zaremba 1,2 New York University Facebook AI Research woj.zaremba@gmail.com Ilya Sutskever 2 Google Brain ilyasu@google.com ABSTRACT arxiv:1505.00521v3

More information

On the Chvatál-Complexity of Knapsack Problems

On the Chvatál-Complexity of Knapsack Problems R u t c o r Research R e o r t On the Chvatál-Comlexity of Knasack Problems Gergely Kovács a Béla Vizvári b RRR 5-08, October 008 RUTCOR Rutgers Center for Oerations Research Rutgers University 640 Bartholomew

More information

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i Comuting with Haar Functions Sami Khuri Deartment of Mathematics and Comuter Science San Jose State University One Washington Square San Jose, CA 9519-0103, USA khuri@juiter.sjsu.edu Fax: (40)94-500 Keywords:

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS Proceedings of DETC 03 ASME 003 Design Engineering Technical Conferences and Comuters and Information in Engineering Conference Chicago, Illinois USA, Setember -6, 003 DETC003/DAC-48760 AN EFFICIENT ALGORITHM

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

Named Entity Recognition using Maximum Entropy Model SEEM5680

Named Entity Recognition using Maximum Entropy Model SEEM5680 Named Entity Recognition using Maximum Entroy Model SEEM5680 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

Multi-Operation Multi-Machine Scheduling

Multi-Operation Multi-Machine Scheduling Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

CSC165H, Mathematical expression and reasoning for computer science week 12

CSC165H, Mathematical expression and reasoning for computer science week 12 CSC165H, Mathematical exression and reasoning for comuter science week 1 nd December 005 Gary Baumgartner and Danny Hea hea@cs.toronto.edu SF4306A 416-978-5899 htt//www.cs.toronto.edu/~hea/165/s005/index.shtml

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

Unit 1 - Computer Arithmetic

Unit 1 - Computer Arithmetic FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

A New Perspective on Learning Linear Separators with Large L q L p Margins

A New Perspective on Learning Linear Separators with Large L q L p Margins A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical

More information

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition A Qualitative Event-based Aroach to Multile Fault Diagnosis in Continuous Systems using Structural Model Decomosition Matthew J. Daigle a,,, Anibal Bregon b,, Xenofon Koutsoukos c, Gautam Biswas c, Belarmino

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction GOOD MODELS FOR CUBIC SURFACES ANDREAS-STEPHAN ELSENHANS Abstract. This article describes an algorithm for finding a model of a hyersurface with small coefficients. It is shown that the aroach works in

More information

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm Gabriel Noriega, José Restreo, Víctor Guzmán, Maribel Giménez and José Aller Universidad Simón Bolívar Valle de Sartenejas,

More information

CMSC 425: Lecture 4 Geometry and Geometric Programming

CMSC 425: Lecture 4 Geometry and Geometric Programming CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

Principal Components Analysis and Unsupervised Hebbian Learning

Principal Components Analysis and Unsupervised Hebbian Learning Princial Comonents Analysis and Unsuervised Hebbian Learning Robert Jacobs Deartment of Brain & Cognitive Sciences University of Rochester Rochester, NY 1467, USA August 8, 008 Reference: Much of the material

More information

Machine Learning: Homework 4

Machine Learning: Homework 4 10-601 Machine Learning: Homework 4 Due 5.m. Monday, February 16, 2015 Instructions Late homework olicy: Homework is worth full credit if submitted before the due date, half credit during the next 48 hours,

More information

Covariance Matrix Estimation for Reinforcement Learning

Covariance Matrix Estimation for Reinforcement Learning Covariance Matrix Estimation for Reinforcement Learning Tomer Lancewicki Deartment of Electrical Engineering and Comuter Science University of Tennessee Knoxville, TN 37996 tlancewi@utk.edu Itamar Arel

More information

Notes on Instrumental Variables Methods

Notes on Instrumental Variables Methods Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of

More information

PER-PATCH METRIC LEARNING FOR ROBUST IMAGE MATCHING. Sezer Karaoglu, Ivo Everts, Jan C. van Gemert, and Theo Gevers

PER-PATCH METRIC LEARNING FOR ROBUST IMAGE MATCHING. Sezer Karaoglu, Ivo Everts, Jan C. van Gemert, and Theo Gevers PER-PATCH METRIC LEARNING FOR ROBUST IMAGE MATCHING Sezer Karaoglu, Ivo Everts, Jan C. van Gemert, and Theo Gevers Intelligent Systems Lab, Amsterdam, University of Amsterdam, 1098 XH Amsterdam, The Netherlands

More information

Fuzzy Automata Induction using Construction Method

Fuzzy Automata Induction using Construction Method Journal of Mathematics and Statistics 2 (2): 395-4, 26 ISSN 1549-3644 26 Science Publications Fuzzy Automata Induction using Construction Method 1,2 Mo Zhi Wen and 2 Wan Min 1 Center of Intelligent Control

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Chapter 7 Rational and Irrational Numbers

Chapter 7 Rational and Irrational Numbers Chater 7 Rational and Irrational Numbers In this chater we first review the real line model for numbers, as discussed in Chater 2 of seventh grade, by recalling how the integers and then the rational numbers

More information

Metrics Performance Evaluation: Application to Face Recognition

Metrics Performance Evaluation: Application to Face Recognition Metrics Performance Evaluation: Alication to Face Recognition Naser Zaeri, Abeer AlSadeq, and Abdallah Cherri Electrical Engineering Det., Kuwait University, P.O. Box 5969, Safat 6, Kuwait {zaery, abeer,

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale World Alied Sciences Journal 5 (5): 546-552, 2008 ISSN 1818-4952 IDOSI Publications, 2008 Multilayer Percetron Neural Network (MLPs) For Analyzing the Proerties of Jordan Oil Shale 1 Jamal M. Nazzal, 2

More information

POINTS ON CONICS MODULO p

POINTS ON CONICS MODULO p POINTS ON CONICS MODULO TEAM 2: JONGMIN BAEK, ANAND DEOPURKAR, AND KATHERINE REDFIELD Abstract. We comute the number of integer oints on conics modulo, where is an odd rime. We extend our results to conics

More information

Topic 7: Using identity types

Topic 7: Using identity types Toic 7: Using identity tyes June 10, 2014 Now we would like to learn how to use identity tyes and how to do some actual mathematics with them. By now we have essentially introduced all inference rules

More information

COMMUNICATION BETWEEN SHAREHOLDERS 1

COMMUNICATION BETWEEN SHAREHOLDERS 1 COMMUNICATION BTWN SHARHOLDRS 1 A B. O A : A D Lemma B.1. U to µ Z r 2 σ2 Z + σ2 X 2r ω 2 an additive constant that does not deend on a or θ, the agents ayoffs can be written as: 2r rθa ω2 + θ µ Y rcov

More information

A Parallel Algorithm for Minimization of Finite Automata

A Parallel Algorithm for Minimization of Finite Automata A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this

More information

Hotelling s Two- Sample T 2

Hotelling s Two- Sample T 2 Chater 600 Hotelling s Two- Samle T Introduction This module calculates ower for the Hotelling s two-grou, T-squared (T) test statistic. Hotelling s T is an extension of the univariate two-samle t-test

More information

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform Statistics II Logistic Regression Çağrı Çöltekin Exam date & time: June 21, 10:00 13:00 (The same day/time lanned at the beginning of the semester) University of Groningen, Det of Information Science May

More information

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift

More information

On the capacity of the general trapdoor channel with feedback

On the capacity of the general trapdoor channel with feedback On the caacity of the general tradoor channel with feedback Jui Wu and Achilleas Anastasooulos Electrical Engineering and Comuter Science Deartment University of Michigan Ann Arbor, MI, 48109-1 email:

More information

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S.

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S. -D Analysis for Iterative Learning Controller for Discrete-ime Systems With Variable Initial Conditions Yong FANG, and ommy W. S. Chow Abstract In this aer, an iterative learning controller alying to linear

More information

Design of NARMA L-2 Control of Nonlinear Inverted Pendulum

Design of NARMA L-2 Control of Nonlinear Inverted Pendulum International Research Journal of Alied and Basic Sciences 016 Available online at www.irjabs.com ISSN 51-838X / Vol, 10 (6): 679-684 Science Exlorer Publications Design of NARMA L- Control of Nonlinear

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve In Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson and R.P. Limann, eds. Morgan Kaumann Publishers, San Mateo CA, 1995,. 950{957. A Simle Weight Decay Can Imrove Generalization

More information

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Journal of Modern Alied Statistical Methods Volume Issue Article 7 --03 A Comarison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Ghadban Khalaf King Khalid University, Saudi

More information

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling Scaling Multile Point Statistics or Non-Stationary Geostatistical Modeling Julián M. Ortiz, Steven Lyster and Clayton V. Deutsch Centre or Comutational Geostatistics Deartment o Civil & Environmental Engineering

More information

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing PHYSICAL REVIEW E 75, 4197 27 Period-two cycles in a feedforward layered neural network model with symmetric sequence rocessing F. L. Metz and W. K. Theumann Instituto de Física, Universidade Federal do

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle] Chater 5 Model checking, verification of CTL One must verify or exel... doubts, and convert them into the certainty of YES or NO. [Thomas Carlyle] 5. The verification setting Page 66 We introduce linear

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

FE FORMULATIONS FOR PLASTICITY

FE FORMULATIONS FOR PLASTICITY G These slides are designed based on the book: Finite Elements in Plasticity Theory and Practice, D.R.J. Owen and E. Hinton, 1970, Pineridge Press Ltd., Swansea, UK. 1 Course Content: A INTRODUCTION AND

More information

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell:

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell: 5. PRESSURE AND VELOCITY SPRING 2019 5.1 The momentum equation 5.2 Pressure-velocity couling 5.3 Pressure-correction methods Summary References Examles 5.1 The Momentum Equation Each comonent of momentum

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA Proceedings of the 2011 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelsach, K. P. White, and M. Fu, eds. EFFICIENT RARE EVENT SIMULATION FOR HEAVY-TAILED SYSTEMS VIA CROSS ENTROPY Jose

More information

q-ary Symmetric Channel for Large q

q-ary Symmetric Channel for Large q List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu

More information

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming Maximum Entroy and the Stress Distribution in Soft Disk Packings Above Jamming Yegang Wu and S. Teitel Deartment of Physics and Astronomy, University of ochester, ochester, New York 467, USA (Dated: August

More information

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process P. Mantalos a1, K. Mattheou b, A. Karagrigoriou b a.deartment of Statistics University of Lund

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

EE 508 Lecture 13. Statistical Characterization of Filter Characteristics

EE 508 Lecture 13. Statistical Characterization of Filter Characteristics EE 508 Lecture 3 Statistical Characterization of Filter Characteristics Comonents used to build filters are not recisely redictable L C Temerature Variations Manufacturing Variations Aging Model variations

More information

Cryptography. Lecture 8. Arpita Patra

Cryptography. Lecture 8. Arpita Patra Crytograhy Lecture 8 Arita Patra Quick Recall and Today s Roadma >> Hash Functions- stands in between ublic and rivate key world >> Key Agreement >> Assumtions in Finite Cyclic grous - DL, CDH, DDH Grous

More information

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques SAT based Abstraction-Refinement using ILP and Machine Learning Techniques 1 SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke James Kukula Anubhav Guta Ofer Strichman

More information

COMPARISON OF VARIOUS OPTIMIZATION TECHNIQUES FOR DESIGN FIR DIGITAL FILTERS

COMPARISON OF VARIOUS OPTIMIZATION TECHNIQUES FOR DESIGN FIR DIGITAL FILTERS NCCI 1 -National Conference on Comutational Instrumentation CSIO Chandigarh, INDIA, 19- March 1 COMPARISON OF VARIOUS OPIMIZAION ECHNIQUES FOR DESIGN FIR DIGIAL FILERS Amanjeet Panghal 1, Nitin Mittal,Devender

More information

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

Sums of independent random variables

Sums of independent random variables 3 Sums of indeendent random variables This lecture collects a number of estimates for sums of indeendent random variables with values in a Banach sace E. We concentrate on sums of the form N γ nx n, where

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

Analysis of some entrance probabilities for killed birth-death processes

Analysis of some entrance probabilities for killed birth-death processes Analysis of some entrance robabilities for killed birth-death rocesses Master s Thesis O.J.G. van der Velde Suervisor: Dr. F.M. Sieksma July 5, 207 Mathematical Institute, Leiden University Contents Introduction

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

7.2 Inference for comparing means of two populations where the samples are independent

7.2 Inference for comparing means of two populations where the samples are independent Objectives 7.2 Inference for comaring means of two oulations where the samles are indeendent Two-samle t significance test (we give three examles) Two-samle t confidence interval htt://onlinestatbook.com/2/tests_of_means/difference_means.ht

More information

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Haiing Lu, K.N. Plataniotis and A.N. Venetsanooulos The Edward S. Rogers Sr. Deartment of

More information

HENSEL S LEMMA KEITH CONRAD

HENSEL S LEMMA KEITH CONRAD HENSEL S LEMMA KEITH CONRAD 1. Introduction In the -adic integers, congruences are aroximations: for a and b in Z, a b mod n is the same as a b 1/ n. Turning information modulo one ower of into similar

More information

Algorithms for Air Traffic Flow Management under Stochastic Environments

Algorithms for Air Traffic Flow Management under Stochastic Environments Algorithms for Air Traffic Flow Management under Stochastic Environments Arnab Nilim and Laurent El Ghaoui Abstract A major ortion of the delay in the Air Traffic Management Systems (ATMS) in US arises

More information

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu and Markov Models Huizhen Yu janey.yu@cs.helsinki.fi Det. Comuter Science, Univ. of Helsinki Some Proerties of Probabilistic Models, Sring, 200 Huizhen Yu (U.H.) and Markov Models Jan. 2 / 32 Huizhen Yu

More information

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition TNN-2007-P-0332.R1 1 Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Haiing Lu, K.N. Plataniotis and A.N. Venetsanooulos The Edward S. Rogers

More information