arxiv: v1 [cs.ai] 19 May 2017

Size: px
Start display at page:

Download "arxiv: v1 [cs.ai] 19 May 2017"

Transcription

1 Model-Based Planning in Discrete Action Spaces Mikael Henaff Courant Institute, New York University William F. Whitney Courant Institute, New York University arxiv: v1 [cs.ai] 19 May 2017 Yann LeCun Courant Institute, New York University Facebook AI Research Abstract Planning actions using learned and differentiable forward models of the world is a general approach which has a number of desirable properties, including improved sample complexity over model-free RL methods, reuse of learned models across different tasks, and the ability to perform efficient gradient-based optimization in continuous action spaces. However, this approach does not apply straightforwardly when the action space is discrete, which may have limited its adoption. In this work, we introduce two discrete planning tasks inspired by existing question-answering datasets and show that it is in fact possible to effectively perform planning via backprop in discrete action spaces using two simple yet principled modifications. Our experiments show that this approach can significantly outperform model-free RL based methods and supervised imitation learners. 1 Introduction Planning actions in order to accomplish a specific goal is a challenge of fundamental interest in artificial intelligence. One of the main paradigms for addressing planning problems is model-free reinforcement learning (RL), a general approach where an agent samples actions according to an internal policy and then adjusts the policy as a function of the reward it receives for different actions. This method has the advantage of making minimal assumptions about the task at hand and can learn complex policies using only the raw state representation and a scalar reward signal. In recent years model-free RL using deep learning has proven successful for a number of applications including Atari games [26], robotic manipulation [11], navigation and reasoning tasks [28], and machine translation [16]. However, it also suffers from several limitations, including a difficult temporal credit assignment problem, high sample complexity [31], and non-stationarity of the input distribution. An alternative approach is imitation learning [29] where the agent is trained to predict actions of an expert given a state; this can be cast as a supervised learning problem. However, small inaccuracies can accumulate over time, leading to situations where the agent finds itself in situations not encountered during training [4]. Furthermore, since the learner does not directly optimize a reward function and instead relies on the expert s policy to be close to optimal, in the absence of additional queries to the environment its performance will be bounded by that of the expert [2]. Model-based planning assumes the existence of a forward model of the environment which can predict how the world will evolve in response to different actions. Actions can then be planned by using this forward model to select a sequence of actions which will take the agent from its current state to a desired goal state. It is particularly appealing to consider differentiable forward models, as they provide gradients which define a direction of improvement for a sequence of actions. These gradients with respect to plans make it possible to do "planning by backprop", directly optimizing

2 a sequence of actions by backpropagating gradients from a goal state through a learned model to update a plan. Planning with a learned forward model has several desirable properties: training a forward model has lower sample complexity than model-free RL due to the rich information content of its highdimensional error signal; it decouples the learned model from the rewards of the specific task it was trained on, allowing the same model to be used for multiple planning tasks; and it can be applied in settings without a known structure. Evidence from neuroscience also suggests that humans use forward models for certain decision-making tasks [10]. Applying gradient-based optimization is straightforward when the action space is continuous, since the backpropagation rule allows easy computation of the gradients with respect to the input actions. However, many planning tasks require an agent to select actions from a discrete set, and performing gradient-based optimization in this discrete action space is not straightforward, since it may yield a solution which is not a valid action. In this work, we show that by using a simple paramaterization of actions on the simplex combined with input noise during planning, we are able to effectively perform gradient-based planning in discrete action spaces. We provide theoretical and empirical evidence that adding random noise to the inputs naturally leads to solutions close to one-hot action vectors. We introduce two novel discrete planning tasks inspired by existing question-answering benchmarks, and compare our model-based planning method, which we call the Forward Planner, to a model-free RL baseline as well as a supervised imitation learner. Planning by backprop consistently outperforms the alternative methods (at the cost of additional planning time). Although the discrete planning tasks introduced here are simple to describe, none of the methods are able to solve them completely, suggesting they may be used as a testbed for further model development. We hope these results encourage renewed interest in model-based planning in the machine learning community. 2 Planning with Forward Models Planning with a forward model consists of two stages. In the first, the agent constructs a forward model of the world by observing how the environment responds to actions. More precisely, the agent is presented with triplets (s 0, a, s ) from the environment E each consisting of an initial state s 0, an action a, and a final state s. Here s 0, a, s can each represent either single instances or sequences of actions or states. The agent learns a forward model f(s 0, a, θ) with trainable parameters θ to predict a state s which minimizes a loss function L(s, s ) measuring the discrepancy between the predicted state and the observed state: arg min θ E (s0,a,s ) E [ ] L(f(s 0, a, θ), s ) After training, given some initial state s 0 the agent can use the forward model to plan actions with the goal of achieving a desired final state s. This is done by solving the following optimization problem: arg min a L(f(s 0, a, θ), s ) If f is a neural network model and the input action space is continuous, this optimization procedure can be done by straightforwardly applying the backpropagation rule which computes the gradients with respect to both the weights and the inputs. Starting from a randomly initialized action vector, the loss function can be minimized by performing gradient descent in the input action space. When the input space is discrete, input tokens are typically encoded as one-hot vectors, and there is no guarantee that the result of the above optimization procedure will constitute a valid input action sequence where each inferred input element is placed at a one-hot vector. We propose to address this problem with two simple steps. Denote the action space by A = {e 1,..., e d }, where each action is represented by a one-hot vector. The first step is to restrict the input space for each possible action to the simplex d, which includes the one-hot vectors, rather than all of R d. Note that the points on the simplex can be written as n = {z : z i 0, i z i = 1} = {z : 2

3 Algorithm 1 Forward Planner Require: Trained forward model f, initial state s 0, desired final state s, learning rate η. 1: Initialize x t R A from N (0, 0.1) for t = 1,..., T. 2: for i = 1 : k do 3: x t x t + ɛ for t = 1,..., T. Add noise to inputs 4: a t σ(x t ) for t = 1,..., T. Compute action vectors 5: s f(s 0, a 1,..., a T, θ) Predict final state for this action sequence 6: Compute L(s, s ) and s Forward and backprop through L 7: Compute a t for t = 1,..., T Backprop through f using s 8: Compute x t for t = 1,..., T Backprop through σ using a t 9: x t ADAM(x t, x t, η) for t = 1,..., T. Update using ADAM 10: end for 11: a t σ(x t ) for t = 1,..., T. 12: a t arg min ei {e 1,...,e d } a t e i for t = 1,..., T. Quantize actions to one-hot vectors 13: return a 1,..., a T z = σ(x), x R d }, where σ represents the softmax function. A relaxed optimization problem can thus be reformulated as: arg min x=(x 1,...,x T ) L(f(s 0, σ(x), θ), s ) (1) where σ(x) corresponds to softmaxes applied to each x t separately. A sequence of tokens can be chosen by minimizing the above loss function, quantizing each σ(x t ) to the closest one-hot vector, and mapping back to the corresponding token. The second step consists of adding random noise ɛ to the inputs x during optimization, which we show has the effect of biasing solutions towards the one-hot vectors. The work of [3] shows that adding independent noise to the weights of a neural network during training induces an additional term on the loss function. Applying the same derivation, we can show that adding noise to the inputs x during minimization corresponds to adding the following penalty: P = i 2 f(s 0, σ(x), θ) 2 x i L s ɛ i + i 2 f x i ɛ i (2) Here ɛ i denotes the variance of the noise along each component. The second-order part of the first term represents the trace of the Hessian of the loss with respect to the inputs, which is equal to the sum of its eigenvalues. This can be viewed as a rough approximation to the curvature at the current point. The first-order part L s changes the sign of the term from negative to positive as the current action vector moves from regions of high loss to regions of low loss. This term thus penalizes regions of the input space which have low loss but high curvature, e.g. very sharp local minima. The second term penalizes input configurations which lead the function to have high sensitivity to input variations. This biases the inputs towards regions where small perturbations result in near-zero changes to the output, such as saturated regions of the softmax. In particular, this term is minimized when a t = σ(x t ) is a unit vector. All together this regularizer encourages the optimization to find broad, flat regions of the space instead of narrow optima. An alternative to adding noise to the inputs x t would be to add noise to the gradients x t. In the case of a fixed learning rate, this is essentially equivalent since adding noise to the gradient the same as adding noise to the updated x t scaled by the learning rate. In the case of adaptive learning rates such as that used in ADAM [20], this corresponds to scaling the noise by the running average of the gradient magnitude, which may be desirable. Finally, an explicit way of encouraging solutions towards the one-hot vectors would be to directly penalize the entropy of the actions, i.e. add an extra term H(σ(x t )) to the loss function, where H represents the entropy operator. In our experiments, we investigate these three approaches. 3

4 Table 1: Modification of QA tasks to become planning tasks. Asterisks indicate elements which the method must fill in. (A) NAVIGATION TASK QA TASK PLANNING TASK agent1 is at (8,4) agent1 is at (8,4) agent1 faces-e agent1 faces-e agent1 moves-2 * agent1 moves-5 * agent1 faces-s * agent1 moves-5 * agent1 faces-s * agent1 moves-1 * agent1 moves-1 * agent1 moves-1 * Q1: where is agent1? Q1: where is agent1? A1: * A1: (10,1) (B) TRANSPORT TASK QA TASK PLANNING TASK object1 is at location3 object1 is at location3 object2 is at location4 object2 is at location4 object3 is at location2 object3 is at location2 Jason went to location3 Jason went to location3 Jason picked up the objects * Jason went to location2 * Jason picked up the objects * Jason went to location1 * Q1: Where is object1? Q1: Where is object1? Q2: Where is object2? Q2: Where is object2? Q3: Where is object3? Q3: Where is object3? A1: * A1: location1 A2: * A2: location4 A3: * A3: location1 3 Tasks and Architectures In this section we introduce two discrete planning tasks inspired by question-answering datasets and the various methods we apply to solve them. Textual question-answering tasks are a common benchmark in machine learning [34, 15, 14]. In these tasks, the model is given triplets consisting of a story (or context), a question, and an answer, and it is trained to predict the correct answer given the question and context. Several of the babi tasks introduced in [34] consist of stories describing the actions of one or more agents acting in the world, and the question requires the model to produce an answer about some aspect of the world which depends on the agent s actions. The question together with its answer can jointly be seen as representing some aspect of the final state of the story, which we denote by s, and the story can be seen as consisting of a sequence of actions a. Tasks of this nature can be straightforwardly adapted to become planning tasks. Instead of requiring the model to produce s given a, we require it to produce a given s, together with an initial state of the story. In other words, we give the model a first part of the story describing the initial state of the world, fix a set of questions and their answers, and require the model to produce a sequence of actions a such that the answers to the questions must be the ones specified. 3.1 Navigation Task The first task we introduce is a navigation task, and is inspired by the World Model task in [13]. In this scenario, an agent moves around on a 2-D grid, and the model must learn to plan sequences of actions to take the agent from one point to another. The agent can perform one of 10 actions: face in a given direction (north, south, east or west), move ahead between one and five steps, or do nothing. We generated sequences as follows. The agent was first placed at a random location on a grid, and then a number of actions was chosen uniformly between 1 and a maximum length T max. For each action, the agent either chooses to change direction or to move ahead with equal probability. If it changes direction, it randomly faces north, south, east or west, again with equal probability. If it moves, it makes a number of steps uniformly chosen between 1 and 5. If this would ever take the agent off the grid, the agent simply stops at the edge of the grid. There are thus no undefined sequences of actions, but some actions may have no effect (for example, taking steps if the agent is facing the edge of the grid). We define the path distance to be the distance between the start and end points of the agent, measured in l 1 norm. Note that this will typically be different than the number of steps taken, since the agent may change direction or double back. We constructed the training set so that the random walks all had path distance at most 5, by resampling sequences until they had path distance at most 5. This allowed us to test models on sequences with path distance greater than 5 and see whether they could generalize to plan paths which are longer than those seen during training. We generated two datasets with maximum number of steps T max {5, 10}, each with two million samples. 4

5 3.2 Transport Task The second task we introduce requires an agent to travel to different locations to pick up objects. There are 4 objects and 8 locations. In the initial state, each object is placed at a distinct location. At every subsequent timestep, the agent can either travel to a location or attempt to pick up the object at its current location. If the agent attemps to pick up an object when there is none, this action has no effect. At the end of the sequence, the model must provide the current location of each of the objects. The model must thus learn to keep a list of objects being carried in memory, track the agent s location and update the list as a function of the agent s actions. This task is related to babi tasks 2 and 8 ( Lists and Sets, Two supporting Facts ) and requires up to four supporting facts: the original location of an object, whether the agent has moved to that location, whether the agent has picked up the object while at that location, and the current location of the agent. We generated sequences as follows. First, distinct locations for each of the objects are given, and then a number of actions is chosen between 1 and a maximum length T max. For each action, the agent follows the following policy: if the location has an object which is not being carried by the agent, the agent picks it up with probability 0.5; if not, the agent still attempts to pick up an object with probability 0.25 (this action has no effect). If the agent does not attempt to pick up an object, it moves to a random location. We generated three versions of this dataset with different maximum lengths T max = 10, 15, 20, each with 2 million samples. The shorter the maximum sequence length, the more the dataset is biased towards action sequences which do not move any objects (see the Appendix for distributions of objects moved for each sequence length). 3.3 Architectures We now describe the different methods we tested to address these planning tasks; more details are included in the Appendix. As a forward model, we used an RNN with 100 hidden units and parametric ReLU non-linearities combined with a spherical normalization layer to prevent divergence of the activations. The forward model was trained to minimize the cross-entropy between its output at the final timestep and the correct final state. We then used this forward model to perform planning by backprop using the method described in Section 2, which we call the Forward Planner. Our first baseline was a Q-learning agent in the discrete space, trained via a method similar to [26] including target networks and replay memory. To encode stories into state vectors, we used the same recurrent architecture as the forward model where we returned the hidden state at the final step of each sequence. We used the same distribution of initial and final states to train the forward model and the Q-learner, i.e. for some triplet (s 0, a, s ) in the training set, we used s 0, s as initial and goal states for the Q-learner. The Q-learner was free to interact with the environment according to its action policy. At the end of each action sequence, the agent received a reward of 1 from the environment if its current state matched the desired goal state, and a reward of 0 otherwise. Note that this setting, with no intermediate rewards and no intermediate states other than the actions taken, is quite difficult for model-free RL methods; they typically rely on full state observations at each timestep to factorize the problem. As a second baseline, we trained a purely supervised Imitation Learner to predict each action conditioned on previous actions and the desired final state, using the Contextual RNN architecture introduced in [25]. The model is first given the initial state s 0 and then is required to predict the sequence of actions a 1,..., a T conditioned on the desired final state s (which is in this case the "context" per Mikolov). At each timestep, the hidden state is updated using the current action and the previous hidden state, and is then combined with the desired final state to predict the next action. At test time, at each timestep we sample an action from the output distribution over actions and use it as the input action at the next timestep, similarly to how recurrent language models are used to generate text. 4 Related Work The idea of planning actions by backpropagating along a policy trajectory has existed since the 1960 s [19, 7]. These methods were applied to settings where the state transition dynamics of the environment were known analytically and backward derivatives could be computed exactly, such as 5

6 Table 2: Performance for different regularizers (Navigation Task is measured in accuracy, Transport Task in sensitivity). Here γ represents standard deviation of added input/gradient noise or the regularizer coefficient. Navigation Transport γ Input Noise Gradient Noise Entropy Penalty Input Noise Gradient Noise Entropy Penalty Entropy Navigation Task No Regularization Entropy Penalty Input Noise Gradient noise Loss Navigation Task No Regularization Entropy Penalty Input Noise Gradient noise Entropy Transport Task No Regularization Entropy Penalty Input Noise Gradient noise Loss Transport Task No Regularization Entropy Penalty Input Noise Gradient noise Iterations (a) Iterations (b) Iterations Figure 1: Evolution of entropy and loss during inference. (c) Iterations (d) planning flight paths. Later works [30, 27, 17] explored the idea of backpropagating through learned, approximate forward models of the environment. Several recent works have explored predictive modeling in more general settings, such as complex physical systems [9, 5, 6, 22] and video [24, 18]. These works focused mostly on challenges associated with training an accurate forward model, by designing model architectures with shared parameters across different object instances or adversarial loss functions. Other works have used forward models for planning in continous action spaces, such as billiards [9], vehicle navigation [12] or robotics [33, 1, 32, 21]. There has been recent work in continuous relaxations of discrete random variables [23, 8], that also uses a softmax to form a continuous approximation to a discrete set. However, their work is focused on parameterizing categorical distributions over discrete variables in a manner which can be learned by gradient descent, whereas we are interested in the deterministic case. 5 Experiments In this section, we describe the results of applying the Forward Planner (FP), Q-learner and Imitation Learner on the two planning tasks. For the Navigation Task, the performance is measured in accuracy, i.e. the proportion of times the model generates an action sequence which takes it to the specified location. For the Transport task, the distribution is skewed towards sequences where few objects have moved. We therefore calculated the sensitivity (the proportion of objects which have been moved which are correctly classified) and the specificity (the proportion of objects that have not been moved which are correctly classified). Since we found that all models had high specificity (>0.95), we report sensitivity as our measure of performance on the Transport Task unless otherwise noted. Inference with the forward planner was done for 100 optimization updates using a learning rate of 1, unless otherwise noted. 5.1 Effect of Regularizers Our first experiment compared the three different methods to bias actions toward the one-hot vectors: input noise, gradient noise and entropy regularization. Results for several different hyperparameter values are shown in Table 2. For the Navigation tasks, all three methods yielded an improvement over the unregularized case (γ = 0). The input noise yielded the biggest improvement, with the other two being roughly equivalent. For the Transport task, the entropy regularization did not help at all. Input noise and gradient noise had similar best values, however the gradient noise was more 6

7 Table 3: Proportion of runs where planning reaches the desired state on the Navigation Task. Path distances greater than 5 were not seen during training. T max represents the maximum number of time steps in the sequence. (a) T max = 5 Path Distance Q-learner* Imitation Learner Forward Planner (b) T max = 10 Path Distance Q-learner* Imitation Learner Forward Planner Table 4: Sensitivity/Specificity on the Transport Task. (a) T max Q-learner 0.00 / / / 1.0 Imitation Learner 0.02 / / / 0.97 Forward Planner 0.23 / / / 0.99 robust to the hyperparameter γ. We then plotted the evolution of the network loss and entropy during minimization for each of the methods, using the values of γ which gave the best performance (Figure 1). When no regularization is used, the entropy decreases initially but then stays constant; all the other regularization methods cause the entropy to decrease further throughout the optimization procedure. Out of these, the entropy regularization causes it to decreases the most quickly and smoothly, which is expected since it is being explicitly minimized. Both the input noise and gradient noise cause the entropy to decrease, which confirms our theoretical prediction. Interestingly, the network loss quickly decreases for all of the methods and both tasks independently of the entropy, which indicates that there are many low loss minima on the simplex at varying distances from the one-hot vectors. Even though the entropy-regularized method produces solutions where actions are close to the one-hot vectors, the fact that it has relatively poor performance on the Transport task suggests the existence of configurations of actions close to the one-hot vectors which yield low loss yet still produce incorrect final states, reminiscent of adversarial examples. 5.2 Comparison to Baselines We now compare the Forward Planner to the other two models on both tasks. We chose to use input noise with γ = 1 based on the previous results, and evaluated the subsequent methods on data different from that used in the previous section. We used 100 optimization updates, which means that planning actions with the Forward Planner is more time consuming than with the other models (we later study the tradeoff between planning time and performance). Results comparing the various models on the Navigation Task are shown in Table 3. We report performance on all path distances mixed together according to the distribution used to generate the training set (denoted 1-5), as well as a breakdown of performance by path distance. Note that no sequences with path distance greater than 5 were seen during training; we wanted to see if the models could learn to compose what they had learned about planning smaller paths in order to plan longer ones. The Q-learner performs poorly for all path distances except for 0, indicating that it has learned a policy where it simply stays at the same location (note that paths of distance 0 are the most frequent in the dataset, see Appendix). The Imitation Learner performs well for shorter sequences (T max = 5), especially when planning shorter paths. However its performance drops for path distances beyond 5, 7

8 Accuracy Sensitivity Forward Planner Imitation Learner Q-learner 0.2 Forward Planner 0.1 Imitation Learner Q-learner Iterations (a) Navigation Task (T max = 10) Iterations (b) Transport Task (T max = 15) Figure 2: Performance vs. Planning Time indicating it does not generalize outside its training distribution. For longer sequences (T max = 10), it generalizes somewhat better but the overall performance is lower, possibly because early actions are less correlated with the final states, which makes learning difficult, or due to the compounding of errors when generating actions. In contrast, the Forward Planner performs better for longer sequences and is able to generalize fairly well to sequences longer than those seen during training, indicating it has been able to make use of the information contained in the longer sequences to learn general dynamics of the environment it can use for planning. Table 4 shows the performance of the different models on the Transport Task. Here the Q-learner is not able to learn to move objects as desired. The Imitation Learner is able to learn a sensible policy for the longer sequences, but not the shortest one. This is probably because the shorter sequence dataset has many sequences where objects to not move, and the model does not have enough signal to learn to move them. The Forward Planner is able to learn some information from the short sequence dataset despite the sparsity of its signal, and outperforms the other methods on the longer sequence datasets. The time required to plan a sequence of actions may be an important factor. Although the Forward Planner outperforms the other methods, it also takes 100 times longer to plan sequences, since it performs 100 optimization steps each requiring a forward and backward pass through the recurrent network. We therefore conducted an additional experiment measuring how performance changes with the number of optimization steps. In Figure 2 we plot the accuracy of the Forward Planner for different numbers of inference steps. We see that using 50 updates for the Forward Planner matches the accuracy of the other two methods on the Navigation Task, and accuracy steadily increases with the number of updates. For the Transport Task, the Forward Planner is able to match the performance of the Imitation Learner with very few updates. 6 Conclusion In this work, we introduced a simple yet effective method to perform planning in discrete action spaces using differentiable forward models of the environment. We addressed the issue of backpropagating into undefined regions by using a softmax paramaterization of the action vectors combined with added input noise, which we have shown theoretically and empirically to produce low-entropy solutions, i.e. solutions close to valid actions. Interestingly, this method performs better than explicitly imposing an entropy penalty term. This may be due to the additional term in the penalty induced by the added noise which penalizes the curvature of the loss surface, or because added noise may help escape local minima. Investigating this more thoroughly would be an interesting direction for future work. We also introduced two discrete planning tasks building on previous frameworks for question-answering. Although simple to describe, none of the methods we evaluated solved these tasks completely, hence they may serve for further development of planning algorithms. 8

9 References [1] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems, 19:1, [2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML 04, pages 1, New York, NY, USA, ACM. [3] Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural Comput., 8(3): , April [4] J Andrew Bagnell and Stéphane Ross. Efficient Reductions for Imitation Learning. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, volume 9, pages , [5] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. CoRR, abs/ , [6] Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional object-based approach to learning physical dynamics. CoRR, abs/ , [7] Stuart Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30 45, [8] Shixiang Gu Eric Jang and Ben Poole. Categorical reparameterization with gumbel-softmax. CoRR, abs/ , [9] Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards. CoRR, abs/ , [10] Jan Glascher, Nathaniel Daw, Peter Dayan, and John P. O Doherty. States versus rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4): , [11] Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation. CoRR, abs/ , [12] Jessica Hamrick, Andrew Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter Battaglia. Metacontrol for adaptive imagination-based optimization. ICLR 2017, [13] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state with recurrent entity networks. ICLR, abs/ , [14] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages , [15] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children s books with explicit memory representations. In Proceedings of the International Conference on Learning Representations [16] Alvin C. Grissom Ii, He He, John Morgan, and Hal Daume III. Don t until the final verb wait: Reinforcement learning for simultaneous machine translation. In In Proceedings of EMNLP, [17] Michael I. Jordan and David E. Rumelhart. Forward models: Supervised learning with a distal teacher. Cognitive Science, 16(3): ,

10 [18] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arxiv preprint arxiv: , [19] Henry Kelley. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. 30(10): , [20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/ , [21] Vikash Kumar, Emanuel Todorov, and Sergey Levine. Optimal control with learned local models: Application to dexterous manipulation. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages IEEE, [22] Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages , [23] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/ , [24] Michaël Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/ , [25] Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, December 2-5, 2012, pages , [26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540): , February [27] Derrick Nguyen and Bernard Widrow. Neural networks for control. chapter The Truck Backerupper: An Example of Self-learning in Neural Networks, pages MIT Press, Cambridge, MA, USA, [28] Junhyuk Oh, Valliappa Chockalingam, Satinder P. Singh, and Honglak Lee. Control of memory, active perception, and action in minecraft. CoRR, abs/ , [29] Dean A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3:97, [30] Jurgen Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), [31] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages JMLR.org, [32] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages IEEE, [33] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, Proceedings of the 2005, pages IEEE, [34] Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/ ,

11 Appendix 6.1 Experimental Details All methods were trained using ADAM [20] using minibatches of size 32 for a maximum of two million updates. The forward models had an initial learning rate of 0.01, which we found to work well, and both Q-learners and Imitation Learners had their initial learning rate chosen by grid search over the set {0.01, 0.001, }. For both the forward models and the Imitation Learner, we divided the learning rate by 2 during training whenever the training loss did not decrease for more than 3 epochs, where one epoch was set to be 2000 updates. Due to the noisiness of the reward signal, we did not use this annealing strategy for the Q-learner and instead linearly decreased the learning rate by a factor of 10 over the course of a million updates. The target network of the Q-learner was updated after every 1000 batches of experience. Actions were chosen at training time using an ɛ-greedy policy, with ɛ starting at ɛ 0 = 0.5 and linearly interpolated to ɛ 0 /10 over the course of a million updates. The architecture of the forward model is given by: h t = φ(ua t + V h t 1 ) y t = σ(ch t ) y t is a distribution over final states, and the model is trained to minimize the cross-entropy between y t and the desired final state. The architecture of the Imitation Learner is given by the following equations: (3) h t = φ(ua t + V h t 1 + W s ) z t = φ(ah t + Bs ) y t = σ(cz t ) (4) Here σ represents a softmax function, y t is a distribution over possible actions, and the model is trained to minimize the cross-entropy between y t and a t+1 and thus to predict the next action conditioned on the current action sequence and goal state. 6.2 Dataset Statistics Figure 3 shows the original distribution of path distances for the Navigation Task dataset with T max = 10. The training set is restricted to sequences which have path distance at most 5. Figure 4 shows the distribution of the number of objects moved for the different Transport Task datasets. Note that the shorter sequence datasets are more skewed towards having action sequences which result in few objects being moved. (a) Distribution of path distances for T 10. Figure 3 11

12 (a) T max = 10 (b) T max = 15 (c) T max = 20 Figure 4: Number of objects moved for Transport Task 12

Deep Q-Learning with Recurrent Neural Networks

Deep Q-Learning with Recurrent Neural Networks Deep Q-Learning with Recurrent Neural Networks Clare Chen cchen9@stanford.edu Vincent Ying vincenthying@stanford.edu Dillon Laird dalaird@cs.stanford.edu Abstract Deep reinforcement learning models have

More information

arxiv: v1 [cs.ne] 4 Dec 2015

arxiv: v1 [cs.ne] 4 Dec 2015 Q-Networks for Binary Vector Actions arxiv:1512.01332v1 [cs.ne] 4 Dec 2015 Naoto Yoshida Tohoku University Aramaki Aza Aoba 6-6-01 Sendai 980-8579, Miyagi, Japan naotoyoshida@pfsl.mech.tohoku.ac.jp Abstract

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

Supplementary Material: Control of Memory, Active Perception, and Action in Minecraft

Supplementary Material: Control of Memory, Active Perception, and Action in Minecraft Supplementary Material: Control of Memory, Active Perception, and Action in Minecraft Junhyuk Oh Valliappa Chockalingam Satinder Singh Honglak Lee Computer Science & Engineering, University of Michigan

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning May 2, 216 1 Optimization Details We investigated two different optimization algorithms with our asynchronous framework stochastic

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

VIME: Variational Information Maximizing Exploration

VIME: Variational Information Maximizing Exploration 1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017 1 Exploration in Reinforcement

More information

arxiv: v1 [cs.lg] 28 Dec 2016

arxiv: v1 [cs.lg] 28 Dec 2016 THE PREDICTRON: END-TO-END LEARNING AND PLANNING David Silver*, Hado van Hasselt*, Matteo Hessel*, Tom Schaul*, Arthur Guez*, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto,

More information

Experiments on the Consciousness Prior

Experiments on the Consciousness Prior Yoshua Bengio and William Fedus UNIVERSITÉ DE MONTRÉAL, MILA Abstract Experiments are proposed to explore a novel prior for representation learning, which can be combined with other priors in order to

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Reinforcement Learning. Odelia Schwartz 2016

Reinforcement Learning. Odelia Schwartz 2016 Reinforcement Learning Odelia Schwartz 2016 Forms of learning? Forms of learning Unsupervised learning Supervised learning Reinforcement learning Forms of learning Unsupervised learning Supervised learning

More information

CONCEPT LEARNING WITH ENERGY-BASED MODELS

CONCEPT LEARNING WITH ENERGY-BASED MODELS CONCEPT LEARNING WITH ENERGY-BASED MODELS Igor Mordatch OpenAI San Francisco mordatch@openai.com 1 OVERVIEW Many hallmarks of human intelligence, such as generalizing from limited experience, abstract

More information

arxiv: v1 [cs.lg] 5 Dec 2015

arxiv: v1 [cs.lg] 5 Dec 2015 Deep Attention Recurrent Q-Network Ivan Sorokin, Alexey Seleznev, Mikhail Pavlov, Aleksandr Fedorov, Anastasiia Ignateva 5vision 5visionteam@gmail.com arxiv:1512.1693v1 [cs.lg] 5 Dec 215 Abstract A deep

More information

The Predictron: End-To-End Learning and Planning

The Predictron: End-To-End Learning and Planning David Silver * 1 Hado van Hasselt * 1 Matteo Hessel * 1 Tom Schaul * 1 Arthur Guez * 1 Tim Harley 1 Gabriel Dulac-Arnold 1 David Reichert 1 Neil Rabinowitz 1 Andre Barreto 1 Thomas Degris 1 Abstract One

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

arxiv: v3 [cs.lg] 12 Jan 2016

arxiv: v3 [cs.lg] 12 Jan 2016 REINFORCEMENT LEARNING NEURAL TURING MACHINES - REVISED Wojciech Zaremba 1,2 New York University Facebook AI Research woj.zaremba@gmail.com Ilya Sutskever 2 Google Brain ilyasu@google.com ABSTRACT arxiv:1505.00521v3

More information

Weighted Double Q-learning

Weighted Double Q-learning Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Weighted Double Q-learning Zongzhang Zhang Soochow University Suzhou, Jiangsu 56 China zzzhang@suda.edu.cn

More information

THE PREDICTRON: END-TO-END LEARNING AND PLANNING

THE PREDICTRON: END-TO-END LEARNING AND PLANNING THE PREDICTRON: END-TO-END LEARNING AND PLANNING David Silver*, Hado van Hasselt*, Matteo Hessel*, Tom Schaul*, Arthur Guez*, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto,

More information

Tracking the World State with Recurrent Entity Networks

Tracking the World State with Recurrent Entity Networks Tracking the World State with Recurrent Entity Networks Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun Task At each timestep, get information (in the form of a sentence) about the

More information

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks Tim Salimans OpenAI tim@openai.com Diederik P. Kingma OpenAI dpkingma@openai.com arxiv:1602.07868v3 [cs.lg]

More information

Driving a car with low dimensional input features

Driving a car with low dimensional input features December 2016 Driving a car with low dimensional input features Jean-Claude Manoli jcma@stanford.edu Abstract The aim of this project is to train a Deep Q-Network to drive a car on a race track using only

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning From the basics to Deep RL Olivier Sigaud ISIR, UPMC + INRIA http://people.isir.upmc.fr/sigaud September 14, 2017 1 / 54 Introduction Outline Some quick background about discrete

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

arxiv: v1 [cs.lg] 28 Dec 2017

arxiv: v1 [cs.lg] 28 Dec 2017 PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart Outline Introduction to

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning. Scratching the surface Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

More information

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Stochastic Video Prediction with Deep Conditional Generative Models

Stochastic Video Prediction with Deep Conditional Generative Models Stochastic Video Prediction with Deep Conditional Generative Models Rui Shu Stanford University ruishu@stanford.edu Abstract Frame-to-frame stochasticity remains a big challenge for video prediction. The

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models RBAR: Low-variance, unbiased gradient estimates for discrete latent variable models George Tucker 1,, Andriy Mnih 2, Chris J. Maddison 2,3, Dieterich Lawson 1,*, Jascha Sohl-Dickstein 1 1 Google Brain,

More information

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects Thai T. Pham Stanford Graduate School of Business thaipham@stanford.edu May, 2016 Introduction Observations Many sequential

More information

The Predictron: End-To-End Learning and Planning

The Predictron: End-To-End Learning and Planning David Silver * 1 Hado van Hasselt * 1 Matteo Hessel * 1 Tom Schaul * 1 Arthur Guez * 1 Tim Harley 1 Gabriel Dulac-Arnold 1 David Reichert 1 Neil Rabinowitz 1 Andre Barreto 1 Thomas Degris 1 We applied

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

An Off-policy Policy Gradient Theorem Using Emphatic Weightings An Off-policy Policy Gradient Theorem Using Emphatic Weightings Ehsan Imani, Eric Graves, Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University

More information

Gradient Estimation Using Stochastic Computation Graphs

Gradient Estimation Using Stochastic Computation Graphs Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017 Outline Stochastic Gradient Estimation

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Shangtong Zhang Dept. of Computing Science University of Alberta shangtong.zhang@ualberta.ca Osmar R. Zaiane Dept. of

More information

arxiv: v2 [cs.lg] 2 Oct 2018

arxiv: v2 [cs.lg] 2 Oct 2018 AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control arxiv:171.4423v2 [cs.lg] 2 Oct 218 Seungyul Han Dept. of Electrical Engineering KAIST Daejeon, South Korea 34141 sy.han@kaist.ac.kr

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel & Sergey Levine Department of Electrical Engineering and Computer

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

arxiv: v1 [stat.ml] 3 Nov 2016

arxiv: v1 [stat.ml] 3 Nov 2016 CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX Eric Jang Google Brain ejang@google.com Shixiang Gu University of Cambridge MPI Tübingen Google Brain sg717@cam.ac.uk Ben Poole Stanford University poole@cs.stanford.edu

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control 1 Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control Seungyul Han and Youngchul Sung Abstract arxiv:1710.04423v1 [cs.lg] 12 Oct 2017 Policy gradient methods for direct policy

More information

Learning Continuous Control Policies by Stochastic Value Gradients

Learning Continuous Control Policies by Stochastic Value Gradients Learning Continuous Control Policies by Stochastic Value Gradients Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, Tom Erez Google DeepMind {heess, gregwayne, davidsilver, countzero,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Variational Deep Q Network

Variational Deep Q Network Variational Deep Q Network Yunhao Tang Department of IEOR Columbia University yt2541@columbia.edu Alp Kucukelbir Department of Computer Science Columbia University alp@cs.columbia.edu Abstract We propose

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Optimal Control, Trajectory Optimization, Learning Dynamics

Optimal Control, Trajectory Optimization, Learning Dynamics Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Optimal Control, Trajectory Optimization, Learning Dynamics Katerina Fragkiadaki So far.. Most Reinforcement Learning

More information

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS Sergey Bartunov & Dmitry P. Vetrov National Research University Higher School of Economics (HSE), Yandex Moscow, Russia ABSTRACT We

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT

Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC Shixiang Gu 123, Timothy Lillicrap 4, Zoubin Ghahramani 16, Richard E. Turner 1, Sergey Levine 35 sg717@cam.ac.uk,countzero@google.com,zoubin@eng.cam.ac.uk,

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

arxiv: v1 [cs.lg] 1 Jun 2017

arxiv: v1 [cs.lg] 1 Jun 2017 Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning arxiv:706.00387v [cs.lg] Jun 207 Shixiang Gu University of Cambridge Max Planck Institute

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Learning Long Term Dependencies with Gradient Descent is Difficult

Learning Long Term Dependencies with Gradient Descent is Difficult Learning Long Term Dependencies with Gradient Descent is Difficult IEEE Trans. on Neural Networks 1994 Yoshua Bengio, Patrice Simard, Paolo Frasconi Presented by: Matt Grimes, Ayse Naz Erkan Recurrent

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Relative Entropy Inverse Reinforcement Learning

Relative Entropy Inverse Reinforcement Learning Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de

More information

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen

More information

Deep Gate Recurrent Neural Network

Deep Gate Recurrent Neural Network JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

arxiv: v1 [cs.lg] 4 May 2015

arxiv: v1 [cs.lg] 4 May 2015 Reinforcement Learning Neural Turing Machines arxiv:1505.00521v1 [cs.lg] 4 May 2015 Wojciech Zaremba 1,2 NYU woj.zaremba@gmail.com Abstract Ilya Sutskever 2 Google ilyasu@google.com The expressive power

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Value-Aware Loss Function for Model Learning in Reinforcement Learning

Value-Aware Loss Function for Model Learning in Reinforcement Learning Journal of Machine Learning Research (EWRL Series) (2016) Submitted 9/16; Published 12/16 Value-Aware Loss Function for Model Learning in Reinforcement Learning Amir-massoud Farahmand farahmand@merl.com

More information