Reinforcement Learning. Introduction

Size: px

Start display at page:

Download "Reinforcement Learning. Introduction"

Charlene Wheeler
5 years ago
Views:

1 Reinforcement Learning Introduction

2 Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control (Engineering) Dynamic Programming (Operations Research) Reward systems (Neuro-science) Classical/Operant Conditioning (Psychology)

3 Characteristics of Reinforcement Learning No supervisor, only reward signals Feedback is delayed Sequential decisions Actions effect observations (non i.i.d.)

4 Examples Automated vehicle control An unmanned helicopter learning to fly and perform stunts Game playing Playing backgammon, Atari breakout, Tetris, Tic Tac Toe Medical treatment planning Planning a sequence of treatments based on the effect of past treatments Chat bots Agent figuring out how to make a conversation

5 Markov Decision Process (MDP)

6 Markov Decision Processes (MDP) Sequential decisions in round rounds t = 1,, T Important concepts State Action Reward Markov property: Future is independent of the past given the current state

7 Markov Decision Processes (MDP) Starts at some initial state s 1 In every round t, the agent observes the current state s t, take an action a t, and then observes a reward signal r t transitions to the next state s t+1 Markov Property: Pr s t+1 = s history till time t) = Pr s t = s s t = s, a t = s =: P s,a (s ) E[r t history till time t] = E r t s t = s, a t = a] =: R s,a

8 Markov Decision Processes (MDP) Goal: Maximize some form of cumulative reward Total reward in finite time T T maximize t=1 r t Infinite time average reward 1 T maximize lim T T t=1 r t Discounted sum of rewards maximize r 1 + γ r 2 + γ 2 r γ i 1 r i + where γ < 1

9 Summary Markov Decision Process (MDP) is a tuple (S, s 1, A, P, R) S is a finite set of states A is a finite set of actions P is a state transition probability matrix of dimension S A S P s,a (s ) = Pr s t+1 = s s t = s, a t = a) R is a reward function R s,a = Ex r t s t = s, a t = a] Goal definition, discount factor γ [0,1)

10 Example 0.8 S, 0 S, 1 S, 2 S, 3 S, 4 S, F, 1 F, 1 F, 2 F, 3 F, 4 F, 5

11 Markov Decision Processes Finding an optimal policy: Value functions

12 Overview Markov Decision Process is a tuple (S, s 1, A, P, R) P is a state transition probability matrix of dimension S A S P s,a (s ) = Pr s t+1 = s s t = s, a t = a) R is a reward function R s,a = E R t+1 Goal: s t = s, a t = a] Maximize expected discounted reward E[ t=1 γ t 1 r t s 1 ] where r t = R st,a t, γ [0,1) is a discount factor

13 Policy A policy π: S A is a mapping from state space to action space Following a stationary policy π means taking action a t = π(s t ) at all time steps t Theorem For any discounted MDP, there always exists stationary policy π that is optimal

14 Value function Value function v π (s) of a policy π expected reward starting from state s and then following the policy π v π s = E[ t=1 γ t 1 r t s 1 = s] where a t = π(s t ), E r t s t, a t ] = R st,a t, Pr( s t, a t ) = P st,a t

15 Example 1, , +1 Standing 0.4, , +2 Moving 1, +1 Fallen 0.2, , , -1 Policy: slow action 1 (black) in Fallen state, fast action 2 (green) in Standing and Moving state

16 Bellman equations Value function can be decomposed into immediate reward plus discounted value function of the next state v π s = R s,π(s) + γ P s,π s (s ) v π (s ) Compact matrix notation v π = r π + γp π v π s v π = I γp π 1 r π

17 Example 1, , +1 Standing 0.4, , +2 Moving 1, +1 Fallen 0.2, , , -1 Policy: slow action 1 (black) in Fallen state, fast action 2 (green) in Standing and Moving state

18 Markov Decision Processes Finding an optimal policy: Iterative methods

19 Recap Value function v π (s) of a policy π v π s = E r 1 + γ r 2 + γ 2 r 3 + s 1 = s] Bellman equations v π = r π + γp π v π v π = I γp 1 π r π

20 Optimal Policy Optimal policy when starting in state s: argmax π v π (s)

21 Optimal Policy Define partial ordering over policies π π if v π s v π s for all s Theorem There always exists a policy that is better than all other policies π π for all π Such a policy is called an optimal policy All optimal policies achieve the same value function v s called the optimal value function

22 Bellman Optimality Equations Optimal value functions are recursively related by Bellman optimality equations Matrix notation v s = max a A R s,a + γ s P s,a (s ) v (s ) v = max r π + γp π v π Optimal policy can be computed by solving Bellman equations

23 Solving the Bellman optimality equations No closed form solution in general Iterative solution methods Policy iteration Value Iteration

24 Policy Iteration Start with a random policy π In every iteration, Evaluate the policy Compute the value vector for v π = I P 1 π r π Improve the policy New policy: π (s) = arg max R s,a + γp s,a v π a Stop if no strict improvement (v π = v π ) v π s = max R s,a + γp s,a v π, s a

25 0.4, , -1 Starting Policy: always slow action r π = Standing Fallen , -1 1, , , +2 Moving 0.8, P π = γ = 0.1 1, +1 Iteration 1 v π = I P π 1 r π = Improve policy: Compute arg max a State Standing, Slow Action: = 1.11 Fast Action: R s,a + γp s,a v π = 0.86

26 +1 0.6, -1 Starting Policy: always slow action r π = Standing Fallen , -1 1, , -1 P π = 0.6, +2 Moving 0.8, , +1 Iteration 1 v π = I P π 1 r π = Improve policy: Compute arg max R s,a + γp s,a v π a State Standing, slow action State Moving Slow Action: = Fast Action:

27 1, +1 Iteration 1 0.4, +1 Standing 0.4, , +2 Moving 1, +1 v π = I P π 1 r π = , -1 Fallen 0.2, , +2 Improve policy: Compute arg max a R s,a + γp s,a v π State Standing, SLOW action Starting Policy: always slow action State Moving, FAST action r π = P π =

28 0.4, , -1 New Policy: fast action in moving state, slow elsewhere r π = Standing Fallen , -1 P π = 1, , , +2 Moving 0.8, , +1 Iteration 2 v π = I P π 1 r π = Improve policy: Compute arg max a State Standing, Slow Action: = Fast Action: R s,a + γp s,a v π

29 +1 0.6, -1 New Policy: fast action in moving state, slow elsewhere r π = Standing Fallen , -1 P π = 1, , , +2 Moving 0.8, , +1 Iteration 2 Improve policy: v π = I P 1 π r π = Compute arg max R s,a + γp s,a v π a State Standing: SLOW action State Moving, Fast Action: = Slow Action:

30 1, +1 Iteration 2 Improve policy: +1 Standing 0.4, , +2 Moving 1, +1 v π = I P π 1 r π = , -1 Fallen 0.2, , +2 Compute arg max R s,a + γp s,a v π a State Standing: SLOW action State Moving, FAST action New Policy: fast action in moving state, slow elsewhere r π = P π = New policy is the same as the old policy STOP!

31 Value Iteration method Finding optimal value function No explicit policy In every iteration k, improve the value vector Converges to v v (k+1) (s) = max a Optimal policy given by max a R s,a + γ P s,a v (k) v (k) v R s,a + γ P s,a v

32 Reinforcement Learning Algorithms

33 Model free methods Reinforcement learning MDP with unknown transition model and/or reward distribution Model is unknown but agent observes samples Learn while optimizing the policy

34 Formulation Starts at some initial state s 1 In every round t, the agent observes the current state s t, take an action a t, and then observes a reward signal r t, and next state s t+1 E r t s t = s, a t = a = R s,a Pr(s t+1 = s s t = s, a t = a) = P s,a (s ) {R s,a, P s,a } are unknown

35 Goal Find the optimal policy: Policy that maximizes expected sum of discounted reward {R s,a, P s,a } are unknown

36 Q-learning Uses Q-values instead of value function Q(s, a): the value of taking action a in state s Formally Q s, a = R s,a + γ E s [max Q(s, a )] a Immediate expected reward plus the best utility from the next state onwards. From Bellman optimality equations, an optimal policy π satisfies Q s, π(s) = R s,π(s) + γ E s Q s, π s = v (s)

37 Q-learning Proceeds in discrete rounds t = 1,2,. In every round t, Choose action greedily using estimated Q-values a t = argmax a Q s t, a Take action a t observe reward r t, next state s t+1 Update Q-values for s t, a t Q s t, a t = r t + γ max Q(s t+1, a) a (Compare to Q s, a = R s,a + γ E s [max Q(s, a )]) a

38 The need for Exploration 0.6, , +1 S1 S2 1.0, , +100

39 Epsilon Greedy exploration With probability 1 ε, use greedy action a t = argmax Q s t, a a With probability ε, play random action

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning