Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Size: px

Start display at page:

Download "Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time"

Jack Cameron
5 years ago
Views:

1 Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time Fachbereich Informatik Knowledge Engineering Group David Fischer 1

2 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 2

3 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 3

4 Problem and Motivation Agent in unknown environment Discrete states and actions MDP: {S, A, P, R, H} time horizon R S A 0, 1 P S A S R + unknown set of actions set of states Fachbereich Informatik Knowledge Engineering Group David Fischer 4

5 Domain Example two-armed bandit Lever 1 50% chance of winning Lever 2 60% chance of winning Fachbereich Informatik Knowledge Engineering Group David Fischer 5

6 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 6

7 Value Function 1 V π π H s = R s, π s + P(s s, a)v H 1 s s Bellman s equation transitions of MDP are known can find optimal policy π* and optimal value function V V H s = max a R s, a + P(s s, a)v H 1 s s Problem: P is unknown Fachbereich Informatik Knowledge Engineering Group David Fischer 7

8 Value Function 2 using a belief state b set of Dirichlet distributions b = α(s, a, s ) α 0 (s, a) = α(s, a, s ) s P s b, s, a = α(s, a, s ) α 0 (s, a) get value function without origin P V H b, s = max a R s, a + s P(s b, s, a)v H 1 b, s Fachbereich Informatik Knowledge Engineering Group David Fischer 8

9 Domain Example two-armed bandit Lever 1 50% chance of winning Lever 2 60% chance of winning pulled 100 times paid off 52 times 52% pulled 5 times paid off 2 times 40% Fachbereich Informatik Knowledge Engineering Group David Fischer 9

10 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 10

11 Bayesian Exploration Bonus (BEB) Bonus: β 1 + α 0 (s, a) V H b, s = max a R s, a + β 1 + α 0 (s, a) + P(s b, s, a)v H 1 s b, s Reward Bonus Estimated mean value of next states Fachbereich Informatik Knowledge Engineering Group David Fischer 11

12 Domain Example two-armed bandit Lever 1 50% chance of winning Lever 2 60% chance of winning pulled 100 times paid off 52 times 52% pulled 5 times paid off 2 times 40% R 1 = β R 2 = β Fachbereich Informatik Knowledge Engineering Group David Fischer 12

Domain Example two-armed bandit R 1 = 0.52 + β 1 + 102 R 2 = 0.

525 β = 2 R 1 0.54 β = 2 R 2 = 0.65 β = 3 R 1 0.55 β = 3 R 2 = 0.

13 Domain Example two-armed bandit R 1 = β R 2 = β β = 0 R 1 = 0.52 β = 0 R 2 = 0.4 β = 1 R β = 1 R 2 = β = 2 R β = 2 R 2 = 0.65 β = 3 R β = 3 R 2 = β = 4 R β = 4 R 2 = Fachbereich Informatik Knowledge Engineering Group David Fischer 13

14 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 14

15 Complexity ε-close to the optimal Bayesian policy BEB Ο S A H6 ε 2 log S A δ standard PAC-MDP Ο S 2 A H 6 ε 3 Ο notation suppresses logarithmic factors Fachbereich Informatik Knowledge Engineering Group David Fischer 15

16 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 16

17 Simulated Domain Chain domain with five states and two actions. With probability of 0.2 the agent performs the opposite action as intended Fachbereich Informatik Knowledge Engineering Group David Fischer 17

18 Simulated Domain Result Fachbereich Informatik Knowledge Engineering Group David Fischer 18

19 Simulated Domain Result Fachbereich Informatik Knowledge Engineering Group David Fischer 19

20 Table of Contents Problem and Motivation Algorithm Value Function Bayesian Exploration Bonus Complexity Simulated Domain Conclusion Fachbereich Informatik Knowledge Engineering Group David Fischer 20

21 Conclusion ε-close to the optimal Bayesian policy after a polynomial number of time steps Balanced exploration and exploitation Better complexity compared to standard PAC-MDP (in polynomial time) Fachbereich Informatik Knowledge Engineering Group David Fischer 21

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using