Human-level control through deep reinforcement. Liia Butler

Size: px

Start display at page:

Download "Human-level control through deep reinforcement. Liia Butler"

Darren Page
5 years ago
Views:

1 Humanlevel control through deep reinforcement Liia Butler

2 But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger W. Dijkstra

3 Overview 1. Introduction 2. Reinforcement Learning 3. Deep neural networks 4. Markov Decision Process 5. Algorithm Breakdown 6. Evaluation and conclusions

4 Introduction Deep Qnetwork (DQN) The agent Reinforcement learning plus Deep neural networks Goal: General artificial intelligence How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks? Pixels and game score as input

5 Reinforcement Learning Theory of how software agents may optimize their control of the environment Inspired by the psychological and neuroscientific perspectives on animal behavior One of the three types of machine learning

6 Space Invaders

Deep Neural Networks An architecture in deep learning, type of artificial neural network Artificial neural network: a network of nodes representing processing elements

data Extract highlevel representations from raw data DQN uses "deep convolutional network" 84 x 4 x 4 image produced by preprocessing map three convolutional layers Two

7 Deep Neural Networks An architecture in deep learning, type of artificial neural network Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system Multiple layers of nodes with increasing abstraction of the data Extract highlevel representations from raw data DQN uses "deep convolutional network" 84 x 4 x 4 image produced by preprocessing map three convolutional layers Two fully connected layers /n7540/images/nature14236f4.jpg

8 Markov Decision Process State Action Reward

9 What these mean for DQN State What is going on? Action What can we do? The goal was to be universal so it's represented by screen pixels Ex. moving, direction, buttons Reward What's our motivation? Points, lives, etc.

10 How is DQN going to do this? Preprocessing Reduce input dimensionality, max value for pixel color, remove flickering εgreedy policy choosing the action Bellman equation optimal control of environment, actionvalue function Using a function approximator to estimate the actionvalue function Loss function and Qlearning gradient Experience replay building a data set from agent's experience

11 Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image Φ =preprocessing sequence T = timestep at which game terminates ε = probability in εgreedy policy a = action s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

12 Algorithm Breakdown εgreedy policy Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = timestep at which game terminates ε = probability in εgreedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

13 εgreedy policy How to choose the action 'a' at time 't' Exploration, random Exploitation, best one according to the Q value

14 Algorithm Breakdown Experience Replay Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = timestep at which game terminates ε = probability in εgreedy policy a = action Φ =preprocessing function s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

15 Experience Replay Take action Store transition in memory Sample random minibatch of transitions from D Optimize using gradient descent on target 'y' and Qnetwork

16 Optimizing the QNetwork Bellman Equation: The loss function we have: From this: Gives us the Qlearning gradient:

17 Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight for approximator M = Number of episodes s = sequence x = observation/image T = timestep at which game terminates ε = probability in εgreedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q

18 Breakout!

19 Evaluation and Conclusions Agents vs. Pro gamers Action at 10 Hz (an action every 0.1 seconds), every 6th frame At 60 Hz (every seconds), every frame, only 6 games > 5% better performance Controlled human conditions Out of the 49 games 29 at human or above 20 below

29 out of 49 20 out of 49 http://www.nature.

20 29 out of out of 49 article/nature14236f3.jpg

21 Questions and Discussion What do you think are some nongaming applications of deep reinforcement learning? Do you think that comparing with the "professional human game tester" is a sufficient enough of an evaluation? Is there a better way? Should we even have a general AI, or are we better off with domain specific AIs? Are there other consequences besides a computer beating your high score? (Have we doomed society?)

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental