Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Size: px

Start display at page:

Download "Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan"

Neal Hensley
5 years ago
Views:

1 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver

2 Course progress Learning from examples Definition + fundamental theorem of statistical learning, motivated efficient algorithms/optimization Convexity, greedy optimization gradient descent Neural networks Knowledge Representation NLP Logic Bayes nets Optimization: MCMC HMM Today: reinforcement learning part 1

3 Admin (programming) exercise MCMC announced today Due in 1 week in class, as usual

4 Decisions and planning Thus far: Learning from examples Knowledge representation / language inference/prediction Missing: actions/decisions Learn from interaction RL: no supervisor, only a reward signal Feedback is delayed Time really matters (sequential, non i.i.d data) Agent s actions affect the subsequent data it receives

5 RL - examples Fly stunt maneuvers in a helicopter Defeat the world champion at Backgammon (& Go) Control a power station Make a humanoid robot walk Play Atari games better than humans

6 Reward hypothesis Agent goal: maximize cumulative reward Hypothesis: All goals can be described by the maximization of expected cumulative reward (?) Examples: Fly stunt maneuvers in a helicopter: +ve reward for following desired trajectory ve reward for crashing Backgammon: +/ ve reward for winning/losing a game Make a humanoid robot walk: +ve reward for forward motion ve reward for falling over Play many different Atari games: +/ ve reward for increasing/decreasing score

Sequential decision making Agent takes action Nature responds with reward Agent sees observation Agent has internal state (from all previous observations) s " = f H ", H " = {o

7 Sequential decision making Agent takes action Nature responds with reward Agent sees observation Agent has internal state (from all previous observations) s " = f H ", H " = {o *, r *, a *,, o ".*, r ".*, a ".*, o ", r " } Markovian assumption: state, observation, reward are independent on past given current state Pr[ s " s ".* ] = Pr[ s " s *,, s ".* ]

8 Markovian? State? Actions? Rewards?

9 Robot in a room +1-1 actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT START reward +1 at [4,3], -1 at [4,2] reward for each step states actions rewards what is the solution?

10 Is this a solution? +1-1 only if actions deterministic not in this case (actions are stochastic) solution/policy mapping from each state to an action

11 Optimal policy +1-1

12 Reward for each step

13 Reward for each step:

14 Reward for each step:

15 Reward for each step:

16 Reward for each step:

17 Formal model: Markov Decision Process States: Markov Process (chain) Rewards: Markov Reward Process Decisions: Markov Decision Process

18 Markov Process: the student chain FB C P S facebook sleep FB 0.9 C 0.4 Class P 0.4 S 1 Party

19 Example: the student chain 0.9 facebook sleep 1 Example of episodes (random walks): C C Fb Fb C P S Class C Fb Fb Fb Fb C P S 0.4 Party

20 Markov Reward Process: the student REWARD chain FB C P S Facebook R=-1 Sleep R=0 FB 0.9 C 0.4 Class R=2 P 0.4 S 1 Party R=1

21 Example: the student REWARD chain Markov Reward Process, definition: Tuple (S, P, R, γ) where S = states, including start state P = transition matrix P ;;< = Pr[S "=* = s S " = s] R = reward function, R ; = E[R "=* S " = s] γ [0,1] = discount factor 0.9 Facebook R=-1 Class R=2 Slee p R=0 0.8 Return Exponentially diminishing returns G " = D R "=E γ E.* EF* "G H 0.4 Party R=1 why? γ = 0? γ = 1? With discount factor < 1 à G t always well defined, regardless of stationarity

22 Example: the student REWARD chain 0.9 Facebook R=-1 Class R=2 0.4 Sleep R=0 0.8 Example of episode (random walks): discount factor = ½ C C Fb Fb S total reward = G * = M + 1 P + 0 = = = Party R=1 C Fb Fb Fb Fb C P S G 1 =?

23 The Value function Mapping from states to real numbers: v s = E[G " S " = s " ] 0.9 Facebook R=-1 Slee p R=0 0.8 Class R=2 0.4 Party R=1

24 Value function, γ = Facebook R=-1 V=-1 Sleep R=0 V=0 0.8 FB C FB C P S Class R=2 V=2 P 0.4 S 1 Party R=1 V=1 v s = E[G " S " = s " ] G " = D R "=E γ E.* EF* "G H

25 Computing the value function How can we compute it? 0.9 Facebook R=-1 Slee p R=0 0.8 v s = E[G " S " = s " ] Class R=2 0.4 Party R=1

26 The Bellman equation for MRP v s = R ; + γ D P ;; Vv(s < ) ; 0.9 Facebook R=-1 Slee p R=0 0.8 R ; = E R "=* S " = s Class R=2 0.4 Party R=1 P ;;< = transition probability from s to s

27 The Bellman equation for MRP v s = E G " S " = s = E D γ E.* R "=E EF* "G H S " = s 0.9 Facebook R=-1 Slee p R=0 0.8 R ; = E R "=* S " = s = E R "=* + γ D γ E.* R "=*=E EF* "G H S " = s Class R=2 0.4 = E R "=* + γg "=* S " = s Party R=1 P ;;< = transition probability from s to s = E R "=* + γ v(s "=* ) S " = s = R ; + γ D P ;; V v(s < ) ;

28 Bellman equation in matrix form How can we compute it? v s = R ; + γ D P ;; Vv(s < ) v = R + γ P v For v being the vector of values v(s), R being vector in same space of R s, s S, and P being the transition matrix. Thus, v = I γp.* R System of linear equations (Gaussian elimination, cubic time) ;

29 Markov Decision Process Markov Reward Process, definition: Tuple (S, P, R, A, γ) where S = states, including start state A = set of possible actions P = transition matrix P Z ;;< = Pr[S "=* = s S " = s, A " = a] R = reward function, R Z ; = E[R "=* S " = s, A " = a] γ [0,1] = discount factor Return G " = D R "=E γ E.* EF* "G H Goal: take actions to maximize expected return

30 Policies The Markovian structure è best action depends only on current state! Policy = mapping from state to distribution over actions π: S Δ(A), π a s = Pr[A " = a S " = s] Given a policy, the MDP reduces to a Markov Reward Process

31 Reminder: MDP actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT START reward +1 at [4,3], -1 at [4,2] reward for each step states actions rewards Policies?

32 Reminder 2 State? Actions? Rewards? Policy?

33 Policies: action = study 0.8 Facebook R= Sleep R=0 Class R= Party R=1

34 Policies: action = facebook Facebook R=-1 Sleep R=0 0.9 Class R= Party R=1

35 Fixed policy à Markov Reward process Given a policy, the MDP reduces to a Markov Reward Process P ;;< = Z a π a s P Z ;;< R _ ; = π a s R Z ; Z a Value function for policy: v _ s = E _ [G " S " = s] Action-Value function for policy: q _ s, a = E _ [G " S " = s, A " = a] How to compute the best policy?

36 The Bellman equation Policies satisfy the Bellman equation: v _ s = E _ R "=* + γ v _ (S "=* ) S " = s = R _ ; + γ D P _ ;; Vv _ (s < ) And similarly for value-action function: ; v _ s = D π a s q _ (s, a) Z a Optimal value function, and value-action function v s = max _ Important: v s = max Z v _ s q s, a = max _ q (s, a), why? q _ (s, a)

37 Theorem There exists an optimal policy π (it is deterministic!) All optimal policy achieve the same optimal value v (s) at every state, and the same optimal value-action function q s, a at every state and for every action. How can we find it? Bellman equation: v s = max Z optimality equations: q (s, a) implies Bellman q s, a = R Z ; + γ D P Z ;;< ;< max Z< q (s,a ) v s = max Z R Z ; + γ D P Z ;; Vv (s < ) ;<

38 Summary Markov Reward Process generalization of Markov Chains Markov Decision Processes formalization of learning with state from environment observations in a Markovian world. Bellman equation: fundamental recursive property of MDPs Will enable algorithms (next class )

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control