Time Indexed Hierarchical Relative Entropy Policy Search

Size: px

Start display at page:

Download "Time Indexed Hierarchical Relative Entropy Policy Search"

Denis Rice
5 years ago
Views:

1 Time Indexed Hierarchical Relative Entropy Policy Search Florentin Mehlbeer June 19, / 15

2 Structure Introduction Reinforcement Learning Relative Entropy Policy Search Hierarchical Relative Entropy Policy Search Time Indexed Hierarchical Relative Entropy Policy Search Evaluation Conclusion References 2 / 15

3 Introduction Example: Texas Hold em Poker Every player gets 2 (pocket-)cards initially In 3 steps 5 (community-)cards are shown: Flop (3), Turn (1), River (1) Betting rounds between the stages Remaining player having the best cards (pocket cards + community cards) wins the chips Question: Optimal strategy? 3 / 15

4 Reinforcement Learning Problem Statement Given a set of states S = {s 1, s 2,..., s n } and actions A = {a 1, a 2,..., a m } find a policy π (a s) : V A [0, 1] maximizing the expected return J (π) Reward R a s is collected for every state transition Transition probabilities P a s t 4 / 15

5 Reinforcement Learning Policy Iteration of Actor-Critic Methods State-value function V π : S R yields (approximate) expected future return for every state s S Therefore V π 1 (s) V π 2 (s) s S iff π 1 is better than π 2 while not converged 1. Policy Evaluation Estimate current policy π by calculating its state-value function V π 2. Policy Improvement Generate samples by executing the current policy π and observe rewards Compute error (critic) Adjust the policy s probabilities accordingly 5 / 15

6 Relative Entropy Policy Search Problem Statement Maximize the expected return max p J (π) = max p so that in every iteration s S,a A D (p q) = p (s, a) log µ (s) π (a s) }{{} p(s,a) p (s, a) q (s, a) ɛ Analytical solution yields ( ) q (s, a) exp 1 η δ (s, a) p (s, a) = ( ) b A q (s, b) exp 1 η δ (s, b) R a s 6 / 15

7 Relative Entropy Policy Search Algorithm while not converged do Obtain N samples (s i, a i, t i, r i ) using current policy π k for i = 1 N do δ (s i, a i ) δ (s i, a i ) + [r i + V (t i ) V (s i )] end for (η, V ) Solve Optimization problem π k+1 (a s) = p(s,a) b A p(s,b) end while 7 / 15

8 Hierarchical Relative Entropy Policy Search Idea Goal: Versatile solutions with hierarchical structure Introduce high level actions called options O = {o 1, o 2,..., o n } Option = Sequence of actions Execute 1 option per episode 2 policies needed Supervisory Policy: π (o s) Sub-policy: π (a o, s) 8 / 15

9 Hierarchical Relative Entropy Policy Search Approach Goal: Determine p (s, a, o) Problem: Marginals q (s, a) can be sampled only Idea: Treat both policies as one mixture-of-options policy π (a s) = o O π (o s) π (a o, s) and compute responsibilities p (o s, a) p (o s, a) = q (o s, a) Additional constraint: Bound Entropy of responsibilities p (o s, a) log p (o s, a) κ s S,a A p (s, a) o O Analytical solution yields p (s, a, o) = ( ) q (s, a) p (o s, a) 1+η/ξ exp 1 η δ (s, a) ( ) b A q (s, b) p (o s, b)1+η/ξ exp 1 η δ (s, b) 9 / 15

10 Hierarchical Relative Entropy Policy Search Algorithm while not converged do Obtain N samples (s i, a i, t i, r i ) using current policy π k for i = 1 N do δ (s i, a i ) δ (s i, a i ) + [r i + V (t i ) V (s i )] end for (η, ξ, V ) Solve Optimization problem π k+1 (o s) = π k+1 (a o, s) = end while a A p(s,a,o) t S,a A p(t,a,o) p(s,a,o) b A p(s,b,o) 10 / 15

11 Time Indexed Hierarchical Relative Entropy Policy Search Idea and approach Idea Sequences of L options to reach a certain goal Execute 1 sequence per episode Each option takes 1 of the L time steps Approach Expected return J (π) = L µ L+1 (s) r (s) + µ l (s) π l (a s) s S Known constraints for each l l=1 s S,a A 11 / 15

12 Time Indexed Hierarchical Relative Entropy Policy Search Algorithm while not converged do for l = 1 L do Obtain N samples (s l,i, a l,i, t l,i, r l,i ) using cur. policy π k,l for i = 1 N do δ l (s l,i, a l,i ) δ l (s l,i, a l,i ) + [r l,i + V (t l,i ) V (s l,i )] end for end for (η, ξ, V) Solve Optimization problem for l = 1 L do a A π k+1,l (o s) = p l (s,a,o) end for end while π k+1,l (a o, s) = t S,a A p l (t,a,o) p l (s,a,o) b A p l (s,b,o) 12 / 15

13 Evaluation 13 / 15

14 Conclusion Reinforcement Learning: Improving and executing a policy iteratively REPS solves RL Problem while bounding the KL-Divergence of 2 subsequent state-action distributions Extension to HiREPS introducing options Time Indexed HiREPS: Sequencing of options Applications in sequencing motor tasks 14 / 15

15 References [ 1 ] R. Sutton, A. Barto; Reinforcement Learning: An Introduction; 2005 [ 2 ] J. Peters, K. Mülling, Y. Altün; Relative Entropy Policy Search; 2010 [ 3 ] C. Daniel, G. Neumann, J. Peters; Hierarchical Relative Entropy Policy Search; 2012 [ 4 ] C. Daniel, G. Neumann, O. Kroemer, J. Peters; Learning Sequential Motor Tasks; 2013 [ 5 ] R. Sutton, D. Precup, S. Singh; Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning; / 15

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and