Discrete planning (an introduction)

Size: px

Start display at page:

Download "Discrete planning (an introduction)"

Ami Shaw
5 years ago
Views:

1 Sistemi Intelligenti Corso di Laurea in Informatica, A.A Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/ Milano (MI) Ufficio S Sito per queste lezioni

2 Planning under uncertainty Action selection is often affected by uncertainty Example:

3 Planning under uncertainty Action selection is often affected by uncertainty Example:

4 Planning under uncertainty Action selection is often affected by uncertainty Example: Fast forward for 2 seconds

5 Planning under uncertainty Action selection is often affected by uncertainty Example:? Fast forward for 2 seconds

6 Planning under uncertainty Are the effects of my actions perfectly predictable? Am I always sure about what s going on? Deterministic vs Stochastic transitions Fully observable vs Partially observable states

7 Examples Deterministic transitions, fully observable states Only actuation is needed, no sensing! A move B B

8 Examples Stochastic transitions, fully observable states Surveillance robot Pos A move B Pos B Pos C Broken

9 Examples? Pos A move Pos B? Pos D move Pos C Deterministic transitions, partially observable states

10 Examples? Pos A move 0.99 Pos B? Pos D move 0.99 Pos C Broken Stochastic transitions, partially observable states

11 Markov Decision Processes (MDP) We assume full observability of states, but non-deterministic actions We cannot specify a transition function like before, instead we give a set of transition probabilities Probability of reaching state s, given that current state is s and action a is taken State transitions satisfy the Markov property: they depend only on the current state and not on states visited before

12 Example (Markovian, deterministic)

13 Example (Non-Markovian (?), deterministic) If the snake world is finite, then

14 Example (Markovian, stochastic) Pos A move B Pos B Pos C Broken Stochastic transitions, fully observable states

15 MDPs Can we formulate the problem asking for a plan? Plans are unfit for this situation: we cannot tell how to reach some goal by giving a mere sequence of actions We need a policy Given the current state, returns what action to play It s deterministic: given a state it does not randomize on which action to take It s stationary: it does not change over time These assumption are not restrictive in MDPs

16 MDPs Policy execution: 1. Observe current state s 2. Execute 3. Repeat from 1

17 MDPs We previously spoken about action costs, in MDPs we speak about immediate rewards It s a payoff the agent gets when she transitions from state s, to state s with an action a Rewards generalize in some sense the notion of goal states The objective is to find a policy that maximizes the expected reward over some time horizon H

18 MDP Value iteration Idea: let s introduce the concept of value function How does it work? An agent executing policy π is in state s: how happy is she? This quantity is defined as the expected cumulative reward that can be obtained by executing π from s

19 H=2 s 0

20 H=2 s s 0

21 H=2 0.6 s 3 s s 4 s 0

22 H=2 0.6 s 3 s s 4 s s 2

23 H=2 0.6 s 3 s s 4 s s 5 s s 6

24 H=2 0.6 s 3 s s 4 s s 5 s 2 Discount rewards more and more as they happen in the long future 0.7 s 6

25 MDP Value iteration Clearly we search for a policy π that yields the maximum value in every state We can exploit the Bellman principle of optimality to define a value iteration procedure It s the value of an optimal policy with horizon i

26 MDP Value iteration We can write: The horizon is zero, no action no reward

27 MDP Value iteration We can write: The horizon is 1, there s room for just one action. The best thing to do is selecting the actions that maximize the immediate expected reward.

28 MDP Value iteration We can write: Now the horizon is 2. The optimal policy would select the action that maximizes the immediate expected reward plus the expected discounted reward of acting optimally from the arrival state.

29 We can write: MDP Value iteration

30 Bellman s equation MDP Value iteration

31 Nicola Basilico Dipartimento di Informatica Via Comelico 39/ Milano (MI) Ufficio S Sistemi Intelligenti Corso di Laurea in Informatica, A.A Università degli Studi di Milano

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)