On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

Size: px

Start display at page:

Download "On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets"

Steven Andrews
5 years ago
Views:

1 On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro McGill University Joint work with: Doina Precup and Prakash Pananganden 1

2 Motivation Interested in sequential decision making under uncertainty Agent must infer its state based on observations of environment A larger observation space gives more information, but increases complexity of problem Hardware is cheap and small => many sensors/observations! 2

3 Our contribution Allow subsets of observation space to be specified for planning/learning. Provide theoretical foundations when planning/learning using this idea. Will address questions such as: How is agent s behaviour affected by using only a subset of all observations? How are agent s predictions affected? 3

4 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 4

5 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 5

6 Partially observable MDPs (POMDPs) Maintain a distribution over states based on clues 6

7 Standard POMDPs 6-tuple S, A, P, R, Ω,O consisting of Set of states S, (s, s,t,...) Set of actions A, (a, b,...) +1 ω ω ω 2 Probabilistic transition function P (s, a)(s ) Bounded reward function R(s, a) -1 ω ω 2 0 ω 1 0 ω ω 1 Set of observations Ω +1 ω 3-1 ω 1 +1 ω 2 Observation function O(a, s)(ω) Discount factor 0 γ < 1 7

8 Belief states Move forward 8

9 Belief states A belief state µ is a distribution over S. Given µ, action a and observation ω, there is a unique next belief state τ(µ, a, ω). Can also compute probability of next observations. 9

10 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 10

11 POMDPs 5-tuple S, A, P, Ω,O consisting of Set of states S, (s, s,t,...) Set of actions A, (a, b,...) Probabilistic transition function P (s, a)(s ) Set of observations Observation function O(a, s)(ω) +1 ω ω 3 k-dimensional observation vector = Ω 1 Ω 2 Ω k -1 ω Rewards part of observation vector! Ω O s,ω ω 2 +1 ω 3-1 ω 1 0 ω 1 0 ω 2 0 ω ω 1 +1 ω 2 Discount factor 0 γ < 1 11

12 Partially observable MDPs (POMDPs) State updates 12,,,, Performance 12

13 Specifying data and interest Let D {1, 2,...,k} be indices of observation coordinates used for belief updates Let I {1, 2,...,k} be indices of observation coordinates that are observables of interest for planning/ prediction. Let Ω D be set of observations containing only observations from D. similarly for. Ω I 13

14 New POMDP dynamics We project observation functions with binary projection matrices : Unique next beliefs specific to choice of D : µp a OD ω τ D (µ, a, ω) = µp a O ω D et Φ D Can define a transition fn. between belief states T D (µ, a)(µ ). O D = OΦ D 14

15 Measuring performance Elements from different types. Ω I may be of many Need a way to quantify an agent s performance. We assume a function f : Ω I R that maps observations of interest to a real number. 15

16 Policies and value functions Closed-loop policies: Map belief states to actions ( ) π Π Value of a belief state under : µ π E π H VD,I π = Pr(ω +γ T D (µ, π(µ))(µ )VD,I(µ π I µ, π(µ)))f(ω I ) ) ω I Ω I µ B i=0 γ i r i µ Optimal value function: V D,I(µ) = max a A Pr(ω I µ, a)f(ω I )+γ ω I Ω I µ B T D (µ, a)(µ )VD,I(µ ) 16

17 Is new POMDP definition suitable? Are fully observable MDPs still expressible? Does the definition properly follow intuition? (e.g. do larger observation subsets yield improved performance)? 17

18 MDPs Assume the observations are just S R points to and points to. D S I R 18

19 Optimal value functions Proposition: Given indexing sets and I, then D 2 D 1 V D 2,I V D 1,I 19

20 Convexity of beliefs, { { { ω 2 D 1 (ω 2 ) D 2 D 2 D 1 D 1 20

21 Convexity of beliefs Theorem: Given a belief µ, an action a, D 2 D 1, and observation ω 2 Ω, the unique D2 next belief τ (µ, a, ω can be expressed as a D2 2 ) convex combination of the belief states. {τ D1 (µ, a, ω 1 )} ω1 D 1 (ω 2 ) 21

22 { Convexity of beliefs ω 2, D 1 (ω 2 ) D 2 D 1 22

23 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 23

24 Equivalence relations Partition belief space into equivalence classes Capture some form of behavioural equivalence Two beliefs in same equivalence are behaviourally indistinguishable 24

25 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 25

26 Value function equivalences µ, ν Π µ,ν π Π π(µ) =π(ν) For all belief states let be the set of all policies where µ, ν (D, I) π Π µ,ν Belief states are -closed value equivalent if for all, V π D,I(µ) =V π D,I(ν) Belief states µ, ν are (D, I) -optimal value equivalent if V D,I(µ) =V D,I(ν) 26

27 Closed and optimal value equivalences Theorem: If two states are closed value equivalent, then they are necessarily optimal value equivalent. V π equivalence V equivalence 27

28 Closed and optimal value equivalences V π equivalence V equivalence Lemma: If s 0,t 0 are equivalent and V (s 0 ) >V (t 0 ), then prob. of reaching t 0 from s 0 under π is strictly positive. V π Let Π CV be set of all policies π constructed from some optimal policy π as follows: π(s )=π (s 0 )ifs = t 0 π(s )=π (s ) otherwise 28

29 Closed and optimal value equivalences V π equivalence V equivalence B is set of bounded functions V : S Π CV [0, 1] R B, R(s, π) =R(s, π(s)) Υ(V )(s, π) =γ Υ : B B, V (t (s 0, π) P (s, π(s))(s )V (s,, π)+p (s, π(s))(t 0 ) s s=t 0 τ(e) =R + Υ(e) has least fixed pt e (s, π) =V π (s) Theorem (based on (Kozen, 2007)) Define as V ϕ π Π CV.V (s, π) V (s), then ϕ = ϕ B e ϕ τ(e) ϕ e ϕ if and, then. 29

30 Closed and optimal value equivalences V π equivalence V equivalence Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). Proof: Assume V (s 0 ) >V (t 0 ) s 0 s 0 t 0 < V (s 0 ) V π (s 0 ) s 0 30

31 Closed and optimal value equivalences V π equivalence V equivalence Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). 31

32 Closed and optimal value equivalences Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). Theorem: If s 0 and t 0 are V π equivalent then V (s 0 )=V (t 0 ). V π equivalence V equivalence 32

33 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 33

34 Trajectory equivalences s (π) V π (s)

35 Trajectory equivalence Two belief states µ, ν are I-closed trajectory equivalent if for all π Π µ,ν and all finite observation trajectories, α = ω 1, ω 2,...,ω n Ω I Pr(α µ, π) =Pr(α ν, π) 35

36 Trajectory equivalence Open-loop policies θ Θ map time steps to actions Two belief states µ, ν are I -open trajectory equivalent if for all θ Θ and all finite observation trajectories α = ω 1, ω 2,...,ω n Ω I, Pr(α µ, θ) =Pr(α ν, θ) A trajectory α and open loop policy θ constitute a PSR test (Littman et al., 2002)! a 1, ω 1,a 1, ω 2,...,a n, ω n 36

37 Hierarchy If D I, then the following hierarchy is obtained I open trajectory I closed trajectory (D, I) closed value (D, I) optimal value 37

38 Hierarchy If D I, then the following hierarchy is obtained I open trajectory I closed trajectory (D, I) closed value (D, I) optimal value 38

39 Hierarchy I open trajectory / (D, I) optimal value 39

40 I open trajectory Hierarchy / (D, I) optimal value s and t are open trajectory equivalent s a, b a, b 0.6 = V (s) a, = 0,a,1 γ 1 γ a b a b a, b a, b a, b a, b a a b b 0 s and t are not optimal value equivalent! a, b V (t) = 0.6γ 1 γ t =0.6 40

41 Outline POMDP review New POMDP formulation Equivalence relations Value functions Trajectory predictions Bisimulation Conclusions and future work 41

42 Bisimulation s s ω 1 d ω 1 ω ω 2 ω 3 ω 2 ω 3 42

43 Bisimulation An equivalence relation E is a (D, I) -bisimulation relation if whenever µ, ν are (D, I) -bisimilar then For all ω Ω I, a A, Pr(ω µ, a) =Pr(ω ν,a) For all c B/ E, a A, T D (µ, a)(µ )= T D (ν,a)(µ ) µ c µ c If µ and ν are (D, I) -bisimilar we will write µ ν. 43

44 Deterministic Bisimulation An equivalence relation E is a deterministic (D, I) -bisimulation relation if whenever µ, ν deterministic (D, I) -bisimilar then are For all ω Ω I, a A, Pr(ω µ, a) =Pr(ω ν,a) For all ω Ω D, a A, τ D (µ, a, ω)eτ D (ν, a, ω) If µ and ν are deterministic (D, I) -bisimilar we will write µ ν. 44

45 D I Hierarchy Deterministic (D, I) bisimulation = I open trajectory I closed trajectory (D, I) bisimulation (D, I) closed value (D, I) optimal value 45

46 Hierarchy Deterministic (D, I) bisimulation / (D, I) bisimulation 46

47 (D, I) bisimulation Hierarchy / Deterministic (D, I) bisimulation s s t s t t ω 1 ω 2 ω 2 ω 1 ω 3 ω 3 ω 4 ω 4 47

48 D I Hierarchy Deterministic (D, I) bisimulation = I open trajectory I closed trajectory (D, I) bisimulation (D, I) closed value (D, I) optimal value 48

49 D I Hierarchy Deterministic (D, I) bisimulation (D, I) bisimulation (D, I) closed value I open trajectory (D, I) optimal value 49

50 Hierarchy Deterministic (D, I) bisimulation / (D, I) bisimulation 50

51 Hierarchy Deterministic (D, I) bisimulation / (D, I) bisimulation Ω = {, } {ω 1, ω 2, ω 3, ω 4 } I = {, } D = {ω 1, ω 2, ω 3, ω 4 } s 1 s t s t (, ω 3 ) (, ω 1 ) (, ω 2 ) (, ω 4 ) 1 t Pr( s) =Pr( t) =1 τ D (s, ω 1 ) τ D (t, ω 1 ) τ D (s, ω 2 ) τ D (t, ω 2 ) 51

52 D I Hierarchy Deterministic (D, I) bisimulation (D, I) bisimulation (D, I) closed value I open trajectory (D, I) optimal value 52

53 Strengthening open D I trajectory = (D, I) bisimulation I open trajectory From (Castro et al., 2009) 53

54 Conclusions Subsets must be chosen with care to avoid suboptimal performance Open trajectory equivalence is closely related to PSRs; we showed this is not appropriate with respect to bad choices of D and I. In most situations we would require D I. (D, I) -bisimulation is robust even when D I. 54

55 D I Conclusions Deterministic (D, I) bisimulation = I open trajectory I closed trajectory (D, I) bisimulation (D, I) closed value (D, I) optimal value 55

56 D I Conclusions Deterministic (D, I) bisimulation = (D, I) bisimulation (D, I) closed value / + I open trajectory (D, I) optimal value 56

57 Current work We are currently working on learning algorithms for determining D, assuming I is known. Start with a small observations. D, incrementally add more Start planning/learning with a small D, use an expert/oracle to determine whether more observations are necessary 57

58 Future work We project Ω onto Ω D and Ω I using binary projection matrices. If we allow general projection matrices, does open trajectory equivalence yield something similar to TPSRs (Rosencratz & Gordon, 2004; Boots et al., 2010). Life-long learning: Many tasks to solve, different choices of D and I, depending on task. ranking of observations to dynamically set based on time requirements. D 58

59 References Kozen, D. (2007). Coinductive proof principles for stochastic Processes. Logical Methods in Computer Science 3(4:8). DOI: /LMCS-3 (4:8) Littman, M., R. Sutton, and S. Singh (2002). Predictive representations of state. In Proceedings of the 14th Conference on Advances in Neural Information Processing Systems (NIPS-02), pp Castro, P. S., P. Panangaden, and D. Precup (2009). Notions of state equivalence under partial observability. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09), pp Boots, B., S. M. Siddiqi, and G. J. Gordon (2010). Closing the learning-planning loop with predictive state representations. In Proc. Robotics: Science and Systems VI. Rosencrantz, M. and G. Gordon (2004). Learning low dimensional predictive represen- tations. In Proceedings of the International Conference on Machine Learning (ICML- 04). 59

60 Closed and optimal value equivalences V π equivalence V equivalence Lemma: If s 0 and t 0 are V π equivalent then V (s 0 ) V (t 0 ). V (s 0 ) >V (t 0 ) Proof: Assume. V. V (s, π) V (s)? Yes! Just take V 1 V (s, π) V (s) τ(v )(s, π) V (s)? τ(v )(s, π) =R(s, π)+υ(v )(s, π) = R(s, π(s)) + γ P (s, π(s))(s )V (s, π)+γp (s, π(s))(t 0 )V (s 0, π) s =t 0 = R(s, π (s)) + γ P (s, π (s))(s )V (s, π)+γp (s, π (s))(t 0 )V (s 0, π) s =t 0 I.H. = V (s) Yes! Take any s = t 0 and π Π CV We ve shown that for any s = t 0 and π Π CV, V π (s) V (s) with strict inequality if P (s, π (s))(t 0 ) > 0 π(s )=π (s 0 )ifs = t 0 π(s )=π (s ) otherwise By last corollary we know s = t 0. P(s, π (s ))(t 0 ) > 0 Thus, V π (s ) >V (s ) By contradiction, V (s 0 ) V (t 0 ), contradicting optimality of R(s, π (s)) + γ P (s, π (s))(s )V (s )+γp(s, π (s))(t 0 )V (s 0 ) s =t 0 >R(s, π (s)) + γ P (s, π (s))(s )V (s )+γp(s, π (s))(t 0 )V (t 0 ) s =t 0 V (s ) Q.E.D. 60

61 Specifying data and interest Let Φ D be a projection matrix used to compute O D : n Ω D : O D = OΦ D If we have, the projection yields the following: D 2 D 1 Φ 12 Φ D2 = Φ Φ D1 12 O D2 = O Φ D

62 Approximating bisimulation Proposition: Given D, I, µ, ν may be (D, I i )-bisimilar for all I i I, but fail to be (D, I) -bisimilar. I 1 = {, ω 1 } I 2 = {, ω 2 } D = I = I 1 I 2 s 1/3 1/3 1/6 1/6 0.5 t t (ω 1, ) (, ) (ω 1, ω 2 ) (, ω 2 ) t t 1/2 1/2 t 62

63 Partially observable MDPs (POMDPs) If state is fully observable, it is a MDP? 63

64 Partially observable MDPs (POMDPs) In POMDPs we only receive clues of the state? 64

65 Partially observable MDPs (POMDPs) Maintain a distribution over states based on clues 65

Predictive Timing Models

Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider