Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Size: px

Start display at page:

Download "Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies"

Marion Heath
5 years ago
Views:

1 Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Kamyar Azizzadenesheli U.C. Irvine Joint work with Prof. Anima Anandkumar and Dr. Alessandro Lazaric.

2 Motivation +1 Agent-Environment Interaction. Kamyar Azizzadenesheli (UCI) POMDP 1 / 1

3 Motivation +1 Agent-Environment Interaction. Multi-Armed-Bandit Markov Decision Process (MDP) Full information Planning policy π Kamyar Azizzadenesheli (UCI) POMDP 1 / 1

4 Multi Armed Bandit Single state, Action with highest expectation reward. r 1 r 2 r k a 1 a 2 a k Agent Kamyar Azizzadenesheli (UCI) POMDP 2 / 1

5 Markov Decision Process (MDP) Fully Observable Environment: y = x. Markovian Assumption: P(yt+1 a t,y t ) P(r t a t,y t ) y i y i+1 y i+2 r i r i+1 a i a i+1 Kamyar Azizzadenesheli (UCI) POMDP 3 / 1

6 Markov Decision Process (MDP) Discounted Reward := max π λ t r t Bellman Equation (0 λ < 1) t Kamyar Azizzadenesheli (UCI) POMDP 4 / 1

7 Markov Decision Process (MDP) Discounted Reward := max π λ t r t Bellman Equation (0 λ < 1) t Q(a,x) = E[r(x,a)]+λ x P(x a,x)max a {Q(a,x )} Kamyar Azizzadenesheli (UCI) POMDP 4 / 1

8 Markov Decision Process (MDP) Discounted Reward := max π λ t r t Bellman Equation (0 λ < 1) t Q(a,x) = E[r(x,a)]+λ x P(x a,x)max a {Q(a,x )} Long Term Average Reward := max π r t Poisson Equation t Kamyar Azizzadenesheli (UCI) POMDP 4 / 1

9 Markov Decision Process (MDP) Discounted Reward := max π λ t r t Bellman Equation (0 λ < 1) t Q(a,x) = E[r(x,a)]+λ x P(x a,x)max a {Q(a,x )} Long Term Average Reward := max π 1)π r t Poisson Equation 2)R π (x), P π (x x), η π 3)(I P π )V +ηe = R π V := (Performance Potential) 4)P π V +R π P π V +R π 5)π π t Kamyar Azizzadenesheli (UCI) POMDP 4 / 1

10 Partially Observability, Transition Probability P(x t a t,x t ), Observation Distribution P(y t x t ). x i x i+1 x i+2 r i r i+1 y i y i+1 a i a i+1 Kamyar Azizzadenesheli (UCI) POMDP 5 / 1

11 Partially Observability, Transition Probability P(x t a t,x t ), Observation Distribution P(y t x t ). x i x i+1 x i+2 r i r i+1 y i y i+1 a i a i+1 Efficient Learning Algorithm by Tensor Methods, Kamyar Azizzadenesheli (UCI) POMDP 5 / 1

12 Partially Observability, Transition Probability P(x t a t,x t ), Observation Distribution P(y t x t ). x i x i+1 x i+2 r i r i+1 y i y i+1 a i a i+1 Efficient Learning Algorithm by Tensor Methods, Learning part is solved, remaining part is the planning. Kamyar Azizzadenesheli (UCI) POMDP 5 / 1

13 Distribution over states [b(x 1 ),...,b(x k )], Apply action a, and observe y, b (x ) = P(y x ) x P(x x,a)b(x) P(y b,a)), Kamyar Azizzadenesheli (UCI) POMDP 6 / 1

14 Distribution over states [b(x 1 ),...,b(x k )], Apply action a, and observe y, b (x ) = P(y x ) x P(x x,a)b(x) P(y b,a)), Bellman Equation: (0 λ < 1) Q(a,b) = E[r(b,a)]+λ b P(b a,b)max a {Q(a,b )}. Kamyar Azizzadenesheli (UCI) POMDP 6 / 1

15 Distribution over states [b(x 1 ),...,b(x k )], Apply action a, and observe y, b (x ) = P(y x ) x P(x x,a)b(x) P(y b,a)), Bellman Equation: (0 λ < 1) Q(a,b) = E[r(b,a)]+λ b P(b a,b)max a {Q(a,b )}. Belief point grows O((YA) T ) PSpace-Complete Point-Based VI Kamyar Azizzadenesheli (UCI) POMDP 6 / 1

16 Memory Less Policy, Limited Memory Policy π(a t y t,y t 1,...,y t n+1 ) order-n policy Kamyar Azizzadenesheli (UCI) POMDP 7 / 1

17 Memory Less Policy, Limited Memory Policy π(a t y t,y t 1,...,y t n+1 ) order-n policy In some POMDP settings optimal policy is order-n policy, Kamyar Azizzadenesheli (UCI) POMDP 7 / 1

18 Memory Less Policy, Limited Memory Policy π(a t y t,y t 1,...,y t n+1 ) order-n policy In some POMDP settings optimal policy is order-n policy, Advantages: Memory Efficient, no need to solve PSpace-Complete problem, Kamyar Azizzadenesheli (UCI) POMDP 7 / 1

19 Memory Less Policy, Limited Memory Policy π(a t y t,y t 1,...,y t n+1 ) order-n policy In some POMDP settings optimal policy is order-n policy, Advantages: Memory Efficient, no need to solve PSpace-Complete problem, Optimal Deterministic memoryless policy, [Yanjie Li], [John Loch]. Kamyar Azizzadenesheli (UCI) POMDP 7 / 1

20 Optimal memoryless policy in general is stochastic, Q-function is not contractive mapping, Optimization Problem η = max r t = max P π (x)r π (x) π π t x Where R π (x) = a P(y x)π(a y)r(a, x). y Kamyar Azizzadenesheli (UCI) POMDP 8 / 1

21 Optimal memoryless policy in general is stochastic, Q-function is not contractive mapping, Optimization Problem η = max r t = max P π (x)r π (x) π π t x Where R π (x) = a P(y x)π(a y)r(a, x). y Solution?? Kamyar Azizzadenesheli (UCI) POMDP 8 / 1

22 Thank You!

Reinforcement Learning

Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value