Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Size: px

Start display at page:

Download "Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015"

Bruno Ward
5 years ago
Views:

1 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

2 Outline 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

3 Outline 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

4 What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

5 What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

6 What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

7 What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

8 Toy Problem What is reinforcement learning? s f s s Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

9 Markov Decision Process (MDP) Modeling the problem S S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

10 Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

11 Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

12 Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

13 Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r π (a s) - the agent s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

14 Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r π (a s) - the agent s policy S t 1 p S t p S t+1 p S t+2 R t 1 R t π π π R t+1 A t 1 A t A t+1 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

15 The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

16 The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π ] V π (s) = E γ t r (s t, a t ) s 0 = s [ t=0 0 < γ < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

17 The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π ] V π (s) = E γ t r (s t, a t ) s 0 = s [ t=0 0 < γ < 1 is the discounting factor The goal: find a policy π that maximizes the value s S V (s) = V π (s) = max π Vπ (s) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

18 The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

19 The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) The optimal value satisfies the Bellman equation V [ (s) = max r (s, a) + γv ( s )] π E a π( s) s p( s,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

20 The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) The optimal value satisfies the Bellman equation V [ (s) = max r (s, a) + γv ( s )] π E a π( s) s p( s,a) π is not necessarily unique, but V is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

21 The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

22 The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

23 The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a This means that V (s) = max Q (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

24 The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a This means that V (s) = max Q (s, a) a Conclusion Learning Q makes it easy to obtain an optimal policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

25 Outline Q-Learning 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

26 Learning Q Q-Learning Algorithm If the agent knows the dynamics p and the reward function r, it can find Q by dynamic programing (e.g. value iteration) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

27 Learning Q Q-Learning Algorithm If the agent knows the dynamics p and the reward function r, it can find Q by dynamic programing (e.g. value iteration) Otherwise, it needs to estimate Q from its experience The experience of an agent is a sequence s 0, a 0, r 0, s 1, a 1, r 1, s 2,... The n-th episode is (s n, a n, r n, s n+1 ) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

28 Learning Q Q-Learning Algorithm Q-Learning Initialize Q 0 (s, a) for all a A and s S For each episode n observe the current state s n select and execute an action a n observe the next state s n+1 and reward r n ) Q n (s n, a n ) (1 α n ) Q n 1 (s n, a n ) + α n (r n + γmax Q n 1 (s n+1, a) a α n (0, 1) - the learning rate Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

29 Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

30 Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

31 Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

32 Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) One common approach is to use an ɛ-greedy policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

33 Q-Learning Convergence of the Q function Convergence analysis Theorem Given bounded rewards r n R, and learning rates 0 α n 1 s.t. then with probability 1 α n = n=1 and α 2 n < n=1 Q n (s, a) n Q (s, a) The exploration policy should be such that each state-action pair will be encountered infinitely many times Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

34 Q-Learning Convergence of the Q function Convergence analysis Poof - main idea Define the operator B [Q] s,a = r (s, a) + γ p ( s s, a ) max Q ( s, a ) s a B is a contraction under the norm B [Q 1 ] B [Q 2 ] γ Q 1 Q 2 It is easy to see that B [Q ] = Q We re not done because the updates come from a stochastic process Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

35 Outline Assumptions and Limitations 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

36 Assumptions and Limitations What assumptions did we make? Q-learning is model free The state and reward are assumed to be fully observable POMDP is a much harder problem The Q function can be represented by a look-up table otherwise we need to do something else such as function approximation, and this is where deep learning comes in! Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

37 Outline Summary 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

38 Summary Summary The basic reinforcement learning problem can be modeled as an MDP If the model is known we can solve precisely by dynamic programing Q-learning allows learning an optimal policy when the model is not known, and without trying to learn the model It converges asymptotically to the optimal Q if the learning rate is not too fast and not too slow, and if the exploration policy is satisfactory Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

39 Further Reading Summary Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An. MIT Press, 1998 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, / 21

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari