State Space Abstractions for Reinforcement Learning

Size: px

Start display at page:

Download "State Space Abstractions for Reinforcement Learning"

Gladys McGee
6 years ago
Views:

1 State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24

2 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction Types and Properties Aggregation Hard Aggregation Soft Aggregation 4 Basis Functions for Linear Function Approximation 5 Probabilistic State Representations 2 / 24

3 Introduction MDPs and Abstract MDPs / 24

4 Markov Decision Process Markov Decision Process (MDP) A Markov Decision Process is a tuple {S, A, R, T, γ} S A set of states set of actions R a s : S A R T a reward function ss = P(s t+ = s s t = s, a t = a) γ [, ] discount factor Goal: Find policy π : S A that maximises expected t= γ t R π(st) s t transition function 4 / 24

5 Markov Decision Process Policy Evaluation Q π (s, a) R a s + E T [ t= ] γ t R π(st) s t = R a s + γe T [ R π(s ) s + t= ] γ t R π(s t+) s t+ 5 / 24

6 Markov Decision Process Policy Evaluation π-evaluation: Q π (s, a) R a s + E T [ t= ] γ t R π(st) s t = R a s + γe T [ R π(s ) s + t= ] γ t R π(s t+) s t+ Q π (s, a) R a s + γ [ ] Tss a Qπ (s, π(s )) s 5 / 24

7 Markov Decision Process Policy Evaluation π-evaluation: Q π (s, a) R a s + E T [ t= ] γ t R π(st) s t = R a s + γe T [ R π(s ) s + t= ] γ t R π(s t+) s t+ Q π (s, a) R a s + γ [ ] Tss a Qπ (s, π(s )) s Bellman equation for optimal value: V (s) max a Q (s, a) R a s + γ s [ T a ss max ] Q (s, a ) a [ Q (s, a) = max {R a a s + γ s Tss a V (s ) ]} 5 / 24

8 Markov Decision Process Policy Evaluation π-evaluation: Q π (s, a) R a s + E T [ t= ] γ t R π(st) s t = R a s + γe T [ R π(s ) s + t= ] γ t R π(s t+) s t+ Q π (s, a) R a s + γ [ ] Tss a Qπ (s, π(s )) s Bellman equation for optimal value: V (s) max a Q (s, a) R a s + γ s [ T a ss max ] Q (s, a ) a [ Q (s, a) = max {R a a s + γ s Tss a V (s ) ]} 5 / 24

9 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] 6 / 24

10 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] 6 / 24

11 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] [π: Policy] 6 / 24

12 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

13 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

14 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

15 Markov Decision Process MDP optimal value example Y= Y= X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

16 Markov Decision Process MDP optimal value example Y= Y= 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

17 Markov Decision Process MDP optimal value example Y= Y= 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

18 Markov Decision Process MDP optimal value example Y= Y= 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

19 Markov Decision Process MDP optimal value example Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

20 Markov Decision Process MDP optimal value example Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

21 Markov Decision Process MDP optimal value example Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 6 / 24

22 Reinforcement Learning Reinforcement Learning R, T unknown! Can implement backups with e.g. Q-learning: On experience {s t, a t, r t+, s t+ }, Q(s t, a t ) }{{} Q(s t, a t ) + }{{} α }{{} t new estimate old estimate learning rate (r t+ + γ max Q(s t+, a) Q(s t, a t )) a }{{} Bellman error 7 / 24

23 Reinforcement Learning Reinforcement learning drawbacks Reinforcement learning with lookup-tables for values, indexed by states, is not compact enough to scale to real world problems. No. parameters to learn: EXP(dim[S]) Solutions? 8 / 24

24 Reinforcement Learning Reinforcement learning drawbacks Reinforcement learning with lookup-tables for values, indexed by states, is not compact enough to scale to real world problems. No. parameters to learn: EXP(dim[S]) Solutions? Function approximation (impose some functional form on V, to restrict expressivity.) 8 / 24

25 Reinforcement Learning Reinforcement learning drawbacks Reinforcement learning with lookup-tables for values, indexed by states, is not compact enough to scale to real world problems. No. parameters to learn: EXP(dim[S]) Solutions? Function approximation (impose some functional form on V, to restrict expressivity.) State abstraction (reduce your state space S before planning in it.) 8 / 24

26 State Abstraction State Abstraction Problem: my states-inputs encode superfluous information! Qu: How do I discern relevant vs. irrelevant info automatically? 9 / 24

27 State Abstraction State Abstraction: Idea Idea:. Start with ground representation S (relevant + irrelevant info.) 2. Map S abstract-feature space X (relevant info.). Solve MDP in X -space (easier) 4. Map solution X -space S-space / 24

28 State Abstraction State Abstraction: Idea Idea:. Start with ground representation S (relevant + irrelevant info.) 2. Map S abstract-feature space X (relevant info.). Solve MDP in X -space (easier) 4. Map solution X -space S-space Example: A changing lane driving task should not depend on street name. Without abstraction: agent might re-learn changing lanes each street. / 24

29 State Abstraction State Abstraction: Idea Idea:. Start with ground representation S (relevant + irrelevant info.) 2. Map S abstract-feature space X (relevant info.). Solve MDP in X -space (easier) 4. Map solution X -space S-space Example: A changing lane driving task should not depend on street name. Without abstraction: agent might re-learn changing lanes each street. Good abstractions: Retains task-relevant information, ignores the rest. Represent similar situations similarly (discover generalisations) / 24

30 State Abstraction Abstract MDP Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X φ : X S m / 24

31 State Abstraction Abstract MDP Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X φ : X S m Cluster probability P(x s) = [s φ (x)] / 24

32 State Abstraction Abstract MDP Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X φ : X S m Cluster probability P(x s) = [s φ (x)] Abstract solution: Q(x, a) = R a x + γ ] x [T a xx max a Q(x, a ) / 24

33 State Abstraction Abstract MDP Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X φ : X S m Cluster probability P(x s) = [s φ (x)] Abstract solution: Q(x, a) = R a x + γ ] x [T a xx max a Q(x, a ) Ground solution: Q(s, a) = E x s [Q(x, a)] / 24

34 State Abstraction Abstract MDP (soft) Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X n φ : X S m Cluster probability P(x s) [, ] Abstract solution: Q(x, a) = R a x + γ ] x [T a xx max a Q(x, a ) Ground solution: Q(s, a) = E x s [Q(x, a)] / 24

35 State Abstraction Abstract MDP (soft) Ground representation: S [Li et al., 26] Abstract representation: X, where X << S Abstraction function: φ : S X n φ : X S m Cluster probability P(x s) [, ] Abstract solution: Q(x, a) = R a x + γ ] x [T a xx max a Q(x, a ) Ground solution: Q(s, a) = E x s [Q(x, a)] Relation [Singh et al., 995]: P(s x) = P(x s)pπ (s) P π (x) R a x = [ ] P(s x)r a s = E s x [R a s ] s Txx a = [ ] P(s x)p(x s ) Tss a s,s / 24

36 Abstraction Types and Properties Abstractions: Types and Properties 2 / 24

37 Abstraction Types and Properties Abstraction Objective Bellman error: [ ]} J(s) = V (s) max {R a a s + γ s Tss a V (s ) Minimise: J 2 (s) / 24

38 Abstraction Types and Properties Abstraction Objective Bellman error: [ ]} J(s) = V (s) max {R a a s + γ s Tss a V (s ) Minimise: J 2 (s) Why? True value V π not available for comparison. Approximation error, V π V, is bounded by a constant multiple of the Bellman error norm [Bertsekas et al., 995]. / 24

39 Aggregation Aggregation 4 / 24

40 Aggregation Hard Aggregation Adaptive Partitioning [Lee and Lau, 24] Discretise state space with Voronoi cells automatically. A Voronoi (or nearest-neighbour) quantiser maps states s R n to a finite set of centroids {x,..., x m } each also in R n. φ(s) = {x i : s x i s x j, i j} Desire: aggregate states of similar Q -values (Class φ Q -ish abstraction) number of cells grows (or shrink) as needed 5 / 24

41 Aggregation Hard Aggregation Adaptive Partitioning: Experiments GOAL GOAL START START trials. 2 trials. barrier or obstacle Voronoi cell boundary Codebook vector position and its number Sample trajectory Highest-valued action within each Voronoi cell Failed region 6 / 24

42 Aggregation Hard Aggregation Adaptive Partitioning: Algorithm (approx.) init s t, a t init X {s t } REPEAT: execute a t a t+ ɛ-greedy policy Q(x t+ ) IF x t+ == x t accreward+ = rt+ IF accreward > threshold ELSE X {X, s t+} (split?) accreward IF any neighbouring cells x i, x j have similar Qs (merge?) remove xi and x j from X and insert 2 (x i + x j ) backup Q t t + 7 / 24

43 Aggregation Soft Aggregation Adaptive soft state aggregation Adaptive State Aggregation [Singh et al., 995]:. Compute V (x θ) for all x X, 2. Compute cluster probabilities: P(x s; θ) = exp[θ(x,s)] x exp[θ(x,s)]. Compute Bellman error + derivatives: J(s θ) = V (s θ) [R π s + γ s T π ss V (s θ)] = E x s;θ [V (x θ)] [R π s + γ s T π ss E x s;θ[v (x θ)]] J 2 (θ) θ(x, s) = 2J(s θ)p(x s; θ)( γt π ss )(V (x θ) V (s, θ)) 4. θ θ α J2 (θ) θ 8 / 24

44 Aggregation Soft Aggregation Adaptive soft state aggregation: Results Summed Squared Bellman Error Clusters 4 Clusters Clusters 2 Clusters Iterations of ASA 9 / 24

45 Basis Functions for Linear Function Approximation Basis Functions for Linear Function Approximation 2 / 24

46 Basis Functions for Linear Function Approximation Idea Idea: abstract S and approximate V ( ) Want: easy-to-train V ( ) form, a linear function Method:. compute approx V (x) = β x (use LSTD) 2. compute Bellman errors J(s). learn abstract mapping x = φ θ (s) parameterised by θ (use Bellman error to guide) 4. (optional) aggregate x Choices for φ functional form? Neighbourhood Component Analysis (NCA) [Keller et al., 26] Radial Basis Functions (RBF) [Menache et al., 25] 2 / 24

47 Basis Functions for Linear Function Approximation Neighbourhood Component Analysis Idea: learn a distance metric that optimises nearest-neighbour leave-one-out performance. where s R n, A R d n, d << n. D(s i, s j ) = (s i s j ) Σ (s i s j ), class(s i ) = class(s j ) with probability = (s i s j ) A A(s i s j ), = (As i As j ) (As i As j ), = (x i x j ) (x i x j ), exp( As i As j ) p ij = k i exp( As i As k ) Objective of [Keller et al., 26] for gradient decent: f (A) = ij p ij [J(s i ) J(s j )] 22 / 24

48 Basis Functions for Linear Function Approximation Least-Squares Temporal Difference learning (LSTD) Goal: Find β that mean squared (projected) Bellman error of linear approximation V π (x) β x. [Boyan, 999] We observe {x, r,..., x t, r t }. For each τ [, t], can generate Bellman errors j τ = r τ+ + (γx τ+ x τ ) β TD update: β β + αz τ j τ b t = A t = z t = t z τ r τ τ= t z τ (x τ γx τ+ ) τ= t γ t τ x τ τ= β t = A t b t eligibility vector Good survey of alternatives to MSPBE is [Dann et al., 24] 2 / 24

49 Probabilistic State Representations Probabilistic State Representations 24 / 24

50 Probabilistic State Representations Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and Bertsekas, D. P. (995). Dynamic programming and optimal control, volume. Athena Scientific Belmont, MA. Boyan, J. A. (999). Least-squares temporal difference learning. In ICML, pages Citeseer. Dann, C., Neumann, G., and Peters, J. (24). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 5: Keller, P. W., Mannor, S., and Precup, D. (26). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the 2rd international conference on Machine learning, pages ACM. Lee, I. S. and Lau, H. Y. (24). 24 / 24

51 Probabilistic State Representations Adaptive state space partitioning for reinforcement learning. Engineering applications of artificial intelligence, 7(6): Li, L., Walsh, T. J., and Littman, M. L. (26). Towards a unified theory of state abstraction for mdps. In ISAIM. Menache, I., Mannor, S., and Shimkin, N. (25). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 4(): Singh, S. P., Jaakkola, T., and Jordan, M. I. (995). Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pages / 24

52 Probabilistic State Representations Radial Basis Functions ( ) φ θi (s) = exp 2 (s c i) W i (s c i ) 25 / 24

State Space Abstraction for Reinforcement Learning

State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26 Rowan s introduction 2 / 26 Types of abstraction [LWL6] Abstractions are partitioned based