Planning and Acting in Partially Observable Stochastic Domains

Size: px

Start display at page:

Download "Planning and Acting in Partially Observable Stochastic Domains"

Domenic Cross
5 years ago
Views:

1 Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA **Department of Computer Science, Duke University, Durham, NC, USA ***Microelectronics and Computer Technology Corporation(MCC), Austin, TX, USA Artificial Intelligence 1998 MINSOO KANG February 6th

Partially Observable Markov Decision Process : Basics Observable 2 Partially Observable No Actions Markov Process Hidden Markov Model Actions MDP POMDP Given: S : States / A : Finite

2 Partially Observable Markov Decision Process : Basics Observable 2 Partially Observable No Actions Markov Process Hidden Markov Model Actions MDP POMDP Given: S : States / A : Finite set of Actions / R: Reward / P: transition Probability (as common MDP) O(Ω): set of conditional observation Probabilities (Observation Function) o: set of observations Added for POMDP

3 POMDP : Belief State & Value Function Belief State : Probability distributions over states of the underlying MDP (satisfies Markov Property) Equation for moving from b belief to b belief: Value Function : Ex) Possible State Probability: b(s1,s2,s3) = (0.3, 0.4, 0.3) : b(s1)=0.3, b(s2)=0.4, b(s3)=0.3 b (s1,s2,s3) = (0.1, 0.2, 0.7) : b (s1)=0.1, b (s2)=0.2, b (s3)=0.7 3

4 Belief State Continuous!! S = 2 B = MDP Value Iteration is impossible, since there are infinite number of states (beliefs) Unlike MDP, Optimal Policy in each time period is Non-stationary.(Time-variant) 4

5 Belief States : Example for Larger Dimensions 5

6 Value Function for Belief State 6

7 Value Function for Belief State Sondik (1971) : State Estimator : SE(a,b,o) Where P(b b, a, o) = 1 if SE(b, a, o) = b P(b b, a, o) = 0 otherwise; State Estimator is Binary b (s)=(p(s1 o,a,b),p(s2 o,a,b), ) V(b )=(V(s1,a),V(s2,a) ) 7

8 POMDP: How to solve?(sondik 1971,Littman 1998) Generalized Form Let, (Letting P be finite set of t-step policy makes Vt(b)) Can be represented in Piecewise Linear & Convex Value Function Geometrically. The upper layer part is the Vt(b) we are interested in, and each line represents each action to take when in each belief state. 8

action a1 in state 0 = 2, state 1= 0 Reward for taking action a2 in state0=0

9 POMDP: How to solve?(sondik 1971,Littman 1998) 1. Conduct one-step Policy tree : (Just one action) a1, a2 Reward for taking action a1 in state 0 = 2, state 1= 0 Reward for taking action a2 in state0=0 state1=3 Probability that you are in state 0 The value function(not optimal) here is calculated as below: 9

10 POMDP: How to solve?(sondik 1971,Littman 1998) 2. Extend this to 2 step time horizon tree, and evaluate every possible 2-step policy tree with the value function equation update. 3. Prune the value functions that are dominated by other value function Given an action, Value Function is Light blue colored lines are pruned. 10

11 Example Problem Example from Prof. Wolfram Burgard s Lecture Note(Department of Computer Science in University of Freiburg) Given Action set, Observation set, State set, Reward(Cost),Transition Probability, Observation Function (No discount factor) 11

12 Example Problem If p1 is the probability of being in x1 r(b,a1)=-100p1 +100(1-p1) since b=(p1, 1- p1) r(b,a2)=100p1-50(1-p1) r(b,a3)= -1 For 1-step horizon Value Function 12

13 Example Problem Pruned Optimal Policy for 1-step horizon, a₁ if p₁ < 3/7 a₂ if p₁ 3/7 13

14 Example Problem We will extend the time horizon to t=2, we consider V1 first,(backward Induction) v If we do this similarly with o₂ as well, = 14

15 Example Problem = Pruned Game ends when a1 or a2 is chosen at this point, since action chosen ends the game. However, it is also possible that choosing a3 is optimal, so we have to confirm whether it give optimal value. So let the first action be a3, then the there is a shift in belief state. 15

16 Example Problem It is given that a3 is chosen first, V₂(b) = Max of Value function in t=2, given the belief state 16

17 POMDP: Conclusion Pruning is crucial in lessening the combinatorial explosion In the example above, the unpruned algorithm needs 10 amount of linear equations until t=20, whereas, only 12 equations are needed to represent the value function of pruned algorithm. Researches show that it functions better than MDP on many contexts.(with small states and small action, observation) However, the solving for finite horizon POMDP has complexity of PSPACE-complete, infinite horizon POMDP is undecidable. (which means that finding the polynomial time-complexity algorithm for POMDP is proving P = NP problem) Thus, there are many value function approximation methods, which may be helpful, but the model is limited to very confined research. 17

18 Thank you 18

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and