Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Similar documents
Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Data Informatics. Seon Ho Kim, Ph.D.

Experiments on the Consciousness Prior

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Human-level control through deep reinforcement. Liia Butler

CS Machine Learning Qualifying Exam

Introduction to Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

CS230: Lecture 9 Deep Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

RL 3: Reinforcement Learning

Reinforcement Learning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Payments System Design Using Reinforcement Learning: A Progress Report

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Introduction of Reinforcement Learning

Reinforcement Learning and Control

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Policy Gradient Methods. February 13, 2017

Markov Decision Processes Chapter 17. Mausam

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Supplementary Material: Control of Memory, Active Perception, and Action in Minecraft

Branes with Brains. Reinforcement learning in the landscape of intersecting brane worlds. String_Data 2017, Boston 11/30/2017

Reinforcement Learning

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Approximation Methods in Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Machine Learning. Neural Networks

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning and NLP

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Deep Learning Architectures and Algorithms

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Deep learning on 3D geometries. Hope Yao Design Informatics Lab Department of Mechanical and Aerospace Engineering

Temporal Difference Learning & Policy Iteration

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Lecture 8: Policy Gradient

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement learning an introduction

Multiagent (Deep) Reinforcement Learning

Neural Architectures for Image, Language, and Speech Processing

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

REINFORCEMENT LEARNING

Reinforcement Learning for NLP

Reinforcement Learning: An Introduction

CS599 Lecture 1 Introduction To RL

Reinforcement Learning Part 2

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

CS 188: Artificial Intelligence Spring Announcements

Lecture 23: Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

(Deep) Reinforcement Learning

Reinforcement Learning Wrap-up

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Learning in State-Space Reinforcement Learning CIS 32

CS 570: Machine Learning Seminar. Fall 2016

Decision Theory: Markov Decision Processes

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

, and rewards and transition matrices as shown below:

EteRNA-RL: Designing RNA secondary structures with reinforcement learning. Isaac Kauvar, Ethan Richman, Will Allen

Reinforcement Learning

Learning Tetris. 1 Tetris. February 3, 2009

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Rutgers Business School, Introduction to Probability, 26:960:575. Homework 1. Prepared by Daniel Pirutinsky. Feburary 1st, 2016

Portfolio Optimization

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

arxiv: v1 [cs.ai] 5 Nov 2017

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

CS 598 Statistical Reinforcement Learning. Nan Jiang

Replacing eligibility trace for action-value learning with function approximation

Deep Reinforcement Learning. Scratching the surface

Convolutional Neural Networks

Final Exam, Spring 2006

The Reinforcement Learning Problem

Reinforcement Learning. George Konidaris

Machine Learning I Continuous Reinforcement Learning

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Least Mean Squares Regression

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

UNSUPERVISED LEARNING

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Reinforcement Learning. Machine Learning, Fall 2010

Structured deep models: Deep learning on graphs and beyond

Be able to define the following terms and answer basic questions about them:

Lecture 1: March 7, 2018

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

David Silver, Google DeepMind

Transcription:

Neural Map Structured Memory for Deep RL Emilio Parisotto eparisot@andrew.cmu.edu PhD Student Machine Learning Department Carnegie Mellon University

Supervised Learning Most deep learning problems are posed as supervised learning problems. The model is trained to map from an input to an action: Deep Neural Net Describe what is in an image. Environment is typically static: It does not change over time. Actions are assumed to be independent of another: E.g. labelling one image does not affect the next one. Observation Action

Environments are not always well-behaved Environments are dynamic and change over time: An autonomous agent has to handle new environments. Actions can affect the environment with arbitrary time lags: Buying a stock years in the future can lose all money Labels can be expensive/difficult to obtain: Optimal actuations of a swimming octopus robot.

Reinforcement Learning: Closing the Loop Instead of a label, the agent is provided with a reward signal: High reward == good behaviour Reinforcement Learning produces policies: Behaviors that Map observations to actions Maximize long-term reward Allows learning purposeful behaviours in dynamic environments. Action Reward Observation / State

Deep Reinforcement Learning Use a deep network to do parameterize the policy Adapt parameters to maximize reward using: Q-learning (Mnih et al., 2013) Actor-Critic (Mnih et al., 2016) Deep Neural Net Action Reward How well does it work? Observation / State

Deep Reinforcement Learning Chaplot, Lample, AAAI 2017

Current Memory for RL agents Deep RL does extremely well on reactive tasks. But typically has an effective memory horizon of less than 1 second. Almost all interesting problems are partially observable: 3D games (with long-term objectives) Self-driving cars (partial occlusion) Memory structures will be crucial to scale up to partially-observable tasks. Deep Neural Net Action Reward Observation / State

External Memory? Current memory structures are usually simple: Add an LSTM layer to the network Can we learn an agent with a more advanced external memory? Neural Turing Machines (Graves et al, 2014.) Differentiable Neural Computers (Graves et al, 2016.) Challenge: memory systems are difficult to train, especially using RL. Learned External Memory Deep Neural Net Action Reward Observation / State

Why Memory is Challenging: Write Operations Suppose an agent is in a simple maze: Agent starts at top of map. An agent is shown a color near its initial state. This color determines what the correct goal is.

Why Memory is Challenging: Write Operations Suppose an agent is in a simple maze: Agent starts at top of map. An agent is shown a color near its initial state. This color determines what the correct goal is.

Why Memory is Challenging: Write Operations Suppose an agent is in a simple maze: Agent starts at top of map. An agent is shown a color near its initial state. This color determines what the correct goal is.

Why Writing to Memory is Challenging At the start, no a priori knowledge to store color into memory. Needs the following to hold: 1. Write color to memory at start of maze. 2. Never overwrite memory of color over T time steps. 3. Find and enter the goal. All conditions must hold or else episode is useless. Provides little new information to the agent. Solution: Write everything into memory!

Memory Network A class of structures that were recently shown to learn difficult mazebased memory tasks These systems just store (key,value) representations for the M last frames At each time step, they: 1. Perform a read operation over their memory database. 2. Write the latest percept into memory. Oh et al., 2016

Memory Network Difficulties Easy to learn never need to guess on what to store in memory. Just store as much as possible! But can be inefficient: We need M > time horizon of the task (can t know this a-priori). We might store a lot of useless/redundant data Time/space requirements increase with M

Neural Map (Location-Aware Memory) Writeable memory with a specific inductive bias: We structure the memory into a WxW grid of K-dim cells. For every (x,y) in the environment, we write to (x,y ) in the WxW grid.

W W K M t

W W K M t

W W K M t

W W K M t

W W K M t

W W K M t

W W K M t

W W K M t

Neural Map (Location-Aware Memory) Writeable memory with a specific inductive bias: We structure the memory into a WxW grid of K-dim cells. For every (x,y) in the environment, we write to (x,y ) in the WxW grid. Acts as a map that the agent fills out as it explores. Sparse Write: Inductive bias prevents the agent from overwriting its memory too often. Allowing easier credit assignment over time.

Neural Map: Operations Two read operations: Global summarization Context-based retrieval Sparse write only to agent position. Both read and write vectors are used to compute policy.

Neural Map: Global Read r t Reads from the entire neural map using a deep convolutional network. Produces a vector that provides a global summary. W W K M t

Neural Map: Context Read Associative read operation.

Neural Map: Context Read Simple 2x2 memory M t Color represents memory the agent wrote. M t

Neural Map: Context Read q t Query vector q t from state s t and global read r t (s t, r t ) M t

α t Neural Map: Context Read q t Dot product between query q t and every memory cell. Produces a similarity α t M t

α t Neural Map: Context Read q t Dot product between query q t and every memory cell. Produces a similarity α t M t

α t Neural Map: Context Read q t Dot product between query q t and every memory cell. Produces a similarity α t M t

α t Neural Map: Context Read q t M t Dot product between query q t and every memory cell. Produces a similarity α t

α t Neural Map: Context Read α t M t q t Element-wise product between querysimilarities α t and memory cells M t (s t, r t ) M t

α t Neural Map: Context Read α t M t q t } c t Sum over all 4 positions to get context read vector c t (s t, r t ) M t

α t Neural Map: Context Read α t M t q t } c t (s t, r t ) M t Intuitively: Return vector c t in memory M t closest to query q t

Neural Map: Write Creates a new k-dim vector to write to the current position in the neural map. Update the neural map at the current position with this new vector.

Neural Map: Update M t+1 w t+1 M t

Neural Map: GRU Write Update M t+1 w t+1 M t Chung et al., 2014

Neural Map: Output Output the read vectors and what we wrote. Use those features to calculate a policy.

Results: Random Maze with Indicator Real Map (Not Visible) 3xKxK Input State (Partially observable) 3x15x3

Random Maze Results

2D Maze Visualization

Task: Doom Maze ViZDoom (Kempka et al., 2016)

Doom Maze Results

Egocentric Neural Map Problem with Neural Map: it requires mapping from (x,y) to (x,y ) Means we need to have already solved localization Another way might be to get a map which is egocentric: The agent always writes to the center of the map. When the agent moves, the entire map moves by the opposite amount.

Conclusion Designed a novel memory architecture suited for DRL agents. Spatially structured memory useful for navigation tasks. Sparse write which simplifies credit assignment. Demonstrated its ability to store information over long time lags. Surpassed performance of several previous memory-based agents.

Future Directions Can we extend to multi-agent domains? Multiple agents communicating through shared memory. Can we train an agent to learn how to simultaneously localize and map its environment using the Neural Map? Solves problem of needing an oracle to supply (x,y) position. Can we structure neural maps into a multi-scale hierarchy? Each scale will incorporate longer range information.

Thank you Contact Information: Emilio Parisotto (eparisot@andrew.cmu.edu)

Extra Slides

What does the Neural Map learn to store? Observations True State Context Read Distribution

Neural Map: Summary