REINFORCEMENT LEARNING

Size: px

Start display at page:

Download "REINFORCEMENT LEARNING"

Clementine Fowler
5 years ago
Views:

1 REINFORCEMENT LEARNING

2 Larry Page: Where s Google going next?

3 DeepMind's DQN playing Breakout

4 Contents Introduction to Reinforcement Learning Deep Q-Learning

5 INTRODUCTION TO REINFORCEMENT LEARNING

6 Contents Reinforcement Learning Markov Decision Process Value function Bellman equation Action value function (Q-function)

7 Supervised Learning Training samples: x i, y i Training goal: To find a function y i f(x i ) Atari game example: x i, y i = (, ) Game state Joystick control

agents ought to take actions in an environment so as

8 Reinforcement Learning Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Atari Example

9 Reinforcement Learning Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external en vironment Learning what to do how to map situations to actions so as t o maximize a numerical reward signal

10 Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward (sacrifice short-term gains for gre ater long-term gains) The need to explore and exploit Considers the whole problem of a goal-directed agent interactin g with an uncertain environment

12 THEORY

13 Reinforcement Learning Setting S set of states A set of actions R S A R reward for given state and action.

14 Reinforcement Learning Setting The agent-environment interaction in reinforcement learning P{s t+1 = s, r t+1 = r s t, a t, s t 1, a t 1,, s 0, a 0 }

15 Reward Discount rate: γ [0,1) It is the discount factor, which represents the difference in importance between future rewards and present rewards. R t = r t+1 + γr t+2 + γ 2 r t+3 + = k=0 γ k r t+k+1

16 Markov Decision Process Markov Decision Process a reinforcement learning task that satisfies the Markov Property P{s t+1 = s, r t+1 = r s t, a t, s t 1, a t 1,, s 0, a 0 } = P{s t+1 = s, r t+1 = r s t, a t } P a ss R a ss = P{s t+1 = s s t = s, a t = a} = E{r t+1 s t = s, a t = a, s t+1 = s }

17 Policy Policy π A policy π is a mapping from each state, s S, and action a A(s), to the probability π s, a of tacking action a when in state s.

18 Value Functions State-value function for policy π. V π s = E R t s t = s} = E{ γ k r t+k+1 s t = s} Action-value function for policy π. k=0 Q π s, a = E R t s t = s, a t = a} = E{ γ k r t+k+1 s t = s, a t = a} k=0

19 Optimal Value Functions V s = max π Vπ (s) Q s, a = max π Qπ s, a = E{r t+1 + γv (s t+1 ) s t = s, a t = a}

20 Bellman Equation Bellman Equation for V π V π s = a π(a, s) s P a ss R a ss + γv π s Bellman Equation for V s V s = max a s P a ss R a ss Bellman equation for Q s, a + γv s Q a s, a = s P ss R a ss + γ max a Q s, a

21 Bellman Equation (Fixed policy) Bellman Equation for V π : a = π(s) V π s = s P ss a R ss a + γ s P a ss V π s = R(s, a) + γ s P a ss V π s Bellman Equation for V s : a = π(s) V s = max a s P a ss R a ss + γv s = R s, a + γmax a s P a ss V s Bellman equation for Q s, a Q a s, a = s P ss R a ss + γ max a Q s, a

22 LEARNING METHOD

23 A COMPLETE MODEL OF THE ENVIRONMENT S DYNAMICS IS GIVEN

24 Policy Evaluation Policy Evaluation: for a given policy p, compute the state-value f unction V π Recall: V π s = a π(a, s) s P a ss R a ss + γv π s A system of S simultaneous linear equations

25 Policy Evaluation

26 Policy Improvement Suppose we have computed V for a deterministic policy π. For a given state s, would it be better to do an action a π(s)? The value of doing a in state s is: Q π s, a = s P a ss R a ss = s P a ss R a ss + γv π (s ) + γv π a π s, a Q π s, a It is better to take an action a for state s if and only if Q π s, a > V π (s)

27 Policy Iteration

28 Other methods..

29 DEEP-Q LEARNING

30 Introduction We want to perform human-level control by using deep reinforcement learning. Reinforcement learning apply to Atari 2600 platforms presentative classic game.

31 Q-Networks Represent action value function by Q-network with weights w Q s, a; w Q (s, a)

32 Q-Learning Optimal Q-values should obey Bellman equation Bellman equation for Q s, a Q a s, a = s P ss Q a s, a; w = s P ss R a ss + γ max a R a ss + γ max a Q s, a Q s, a ; w = r + γ max a Q s, a ; w Treat right hand side r + γ max Q s, a ; w as a target a Minimize MSE loss by stochastic gradient descent

33 Deep Q-Networks

34 Deep Reinforcement Learning in Atari

35 DQN in Atari

36 Algorithm φ means stacking recent 4 images.

37 Architecture

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015