State Space Abstraction for Reinforcement Learning

Size: px

Start display at page:

Download "State Space Abstraction for Reinforcement Learning"

Ann Cross
5 years ago
Views:

1 State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26

2 Rowan s introduction 2 / 26

3 Types of abstraction [LWL6] Abstractions are partitioned based on state-action value functions Q(s, a) and Q (s, a) Loosely categorised into five categories / 26

4 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a 4 / 26

5 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. 4 / 26

6 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26

7 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26

8 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a 5 / 26

9 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. 5 / 26

10 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26

11 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26

12 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a 6 / 26

13 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. 6 / 26

14 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26

15 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26

16 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) 7 / 26

17 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. 7 / 26

18 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26

19 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26

20 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 8 / 26

21 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. 8 / 26

22 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

23 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

24 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

25 Types of abstraction - An example (a) 4 s / 26

26 Types of abstraction - An example (a) 4 (b).5 s 2 4 s 2 9 / 26

27 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? / 26

28 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π / 26

29 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π (b).5 a a s 2 / 26

30 Predictive State Representation: An introduction / 26

31 A Zoubin cube for HMM/POMDP/PSR... 2 / 26

32 A Zoubin cube for HMM/POMDP/PSR... 2 / 26

33 Predictive State Representation - Overview A model for controlled dynamical systems / 26

34 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? / 26

35 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models / 26

36 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26

37 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26

38 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x Speedy learning / 26

39 A review of POMDPs 4 / 26

40 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= 4 / 26

41 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= 4 / 26

42 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n 4 / 26

43 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n PSRs remove hidden states and use probabilities of particular future actions/outcomes to encode historical information. 4 / 26

44 Predictive State Representation - Notations 5 / 26

45 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k 5 / 26

46 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n 5 / 26

47 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) 5 / 26

48 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) Set of m core tests T T = {t, t 2,, t m} p(t h) = {p(t h), p(t 2 h),, p(t m h)} 5 / 26

49 Predictive State Representation - Definition 6 / 26

50 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), 6 / 26

51 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t 6 / 26

52 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t Updating p(t h) upon taking action a and observing o p(t i hao) = p(aot i h) p(ao h) = f aoti (p(t h)) f ao (p(t h)) = p (T h)w aot i p (T h)w ao 6 / 26

53 Predictive State Representation - Why use PSR? 7 / 26

54 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26

55 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26

56 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. POMDP: 5 states Linear PSR: T = {r, f r, f f r, f f f r, f f f f r} Nonlinear PSR: T = {r, f r} 7 / 26

57 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). 8 / 26

58 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! 8 / 26

59 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t 8 / 26

60 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t Define H be the set of linearly independent rows of D, then H forms a set of core histories 8 / 26

61 Predictive State Representation - How to discover T [JS4] 9 / 26

62 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end 9 / 26

63 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end Caveats: D cannot be computed exactly in general, hence sampling is needed. T may not be complete, as v i == v i is too strict. 9 / 26

64 Predictive State Representation - How to learn w t in linear PSRs [JS4] 2 / 26

65 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao 2 / 26

66 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aot i H) = p(t H)w aot i p(ao H) = p(t H)w ao 2 / 26

67 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aoti H) = p(t H)w aot i p(ao H) = p(t H)w ao As dim(t ) = dim(h) and rows in p(t H) are linearly independent, w aot i = p (T H)p(aot i H) w ao = p (T H)p(ao H) 2 / 26

68 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. 2 / 26

69 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. An example from [RRST5]: G 2 / 26

70 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

71 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

72 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

73 Predictive State Representation - What s next? For continuous observations: HMM LDS PSR Predictive Linear-Gaussian [RS6] ioom/ipsr: set of core tests T grows while exploring the environment? 2 / 26

74 Summary State space abstraction can provide state space generalisations, can speed up learning/planning in reinforcement learning, is an alternative to value function approximations. Predictive state representations as an alternative to POMDPs model systems using only observable quantities and no hidden states are compact, but hard to learn 24 / 26

75 References Michael R James and Satinder Singh, Learning and discovery of predictive state representations in dynamical systems with reset, Proceedings of the 2st International Conference on Machine learning, ACM, 24, p. 5. Michael Littman, Richard Sutton, and Satinder Singh, Predictive representations of state, Advances in Neural Information Processing Systems 4 (NIPS), 22, pp Lihong Li, Thomas J Walsh, and Michael L Littman, Towards a unified theory of state abstraction for MDPs., ISAIM, 26. Eddie J Rafols, Mark B Ring, Richard S Sutton, and Brian Tanner, Using predictive representations to improve generalization in reinforcement learning., IJCAI, 25, pp Matthew Rudary and Satinder Singh, Predictive linear-gaussian models of controlled stochastic dynamical systems, Proceedings of the 2rd International Conference on Machine learning, ACM, 26, pp / 26

76 Thanks!

A Nonlinear Predictive State Representation

Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu