Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Size: px

Start display at page:

Download "Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies"

Hilary Bridges
5 years ago
Views:

1 Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia Prashant Doshi THINC ab, University of Georgia Bikramjit Banerjee University of Southern Mississippi

2 Introduction Model- free reinforcement learning in multiagent systems is a nascent field Monte Carlo Exploring Starts for POMDPs is a powerful single- agent R technique Policy iteration leveraging Q- learning to hill- climb through the local policy space to local optima Allows PAC bounds to select sample complexity with confidence

3 Introduction We extend MCES- P to the non- cooperative multiagent setting and introduce MCES for Interactive POMDPs Explicitly models the opponent Predicates action- values on expected opponent behavior When instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity We additionally provide a policy space pruning mechanism to promote scalability Parametrically bounds regret from avoiding policies Prioritizes eliminating low- regret policy transformations

4 Background: Multiagent Decision Process In the multiagent setting, all agents affect the state and the reward for each agent Physical State Action Action Agent i Action (Joint) Rewards Action Agent j Reward R(s,a i,a j ) Reward

5 Background: I-POMDP The Interactive POMDP (I- POMDP) (Gmytrasiewicz and Doshi 2005) <IS,A,T,Ω,O,R> Non- cooperative: Agents get individual, potentially competitive rewards Actions A, state transitions T, observations Ω, observation probabilities O, and rewards R IS: Interactive state, combining the physical state and a model of the other agent Significant uncertainty Must reason not only the physical state, but also the opponent s motivations and beliefs

6 Background: MCES-P Template Monte Carlo Exploring Starts for POMDPs (MCES- P) (Perkins - AAAI 2002) General template Explore neighborhood of π - all policies that differ by a single action a on some observation sequence o Compute expected value by simulating policies online Hill climb to policies with better values Terminate if no neighbor is better than the current policy

7 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action o1 a2 o2 {o1,o2}: a1 à a3 o1 a2 o2 a1 a3 a1 a3 o1 o2 o1 o2 o1 o2 o1 o2 a3 a1 a2 a2 a3 a3 a2 a2

8 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π {o1,o2}: a1 ß à a3 π '

9 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π ' o1: a1 ß à a2 {o1,o2}: a1 ß à a2 π ' π ' o1: a1 ß à a3 π {o1,o2}: a1 ß à a3 π ' : a2 ß à a1 : a2 ß à a3 π ' π '

10 Background: MCES-P Template Transformation ocal Neighborhood

11 Background: MCES-P Template Sampling Pick random action and simulate a3 Q *,,. 1 α(m, c,,. ) Q *,,. + α m, c,,. R 9,:;<, (τ)

12 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π

13 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π

14 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε

15 Background: MCES-P Template Termination When all neighbors sampled k times and no neighbor is better

16 Background: MCESP+PAC Problem: Choosing a good sample bound k ow values of k increase the chance we make errors when transforming High values, though requiring more samples, guarantee we hill- climb correctly High Error Probability ow Error Probability Inaccurate Q- values Accurate Q- values

17 Background: MCESP+PAC Solution: Pick a k that guarantees some confidence on the accuracy of the Q- value Probably Approximately Correct (PAC) earning The probability of the sample average deviating from the true mean by more than variance ε is bound by error δ Pr XG μ > ε 2 exp 2k ε Λ P = δ

18 Background: MCESP+PAC With ε and δ, we calculate required samples to satisfy the error bound m is the number of current transformations N is number of neighbor policies δ a = bc a d * d Λ π ', π max g k a = 2 Λ(π) ε P ln 2N δ a (Q * Q *?) min(q * Q *?) 2T R a.i R ajk Λ π = g max Λ(π, *? kmjnop,q * π' )

19 Background: MCESP+PAC We can transform early by modifying ε ε m, p, q = Λ π, π ' 1 2p ln 2 k a 1 N ε δ a 2 if p = q = k a otherwise if p = q < k a Terminate when k a samples of each neighbor is taken or for all neighbor policies: Q,,. < Q,,*(,) + ε ε(m, c,,., c,,*, )

20 Background: MCESP+PAC Then, with probability 1 δ 1. MCESP+PAC picks transformations that are always better than the current policy 2. MCESP+PAC terminates with a policy that is an ε- local optima That is, there is no neighbor that is better than the last policy by more than ε

MCES-P for Multiagent Settings MCES- P can almost be used as is in the multiagent setting Observations Public Noisily indicates physical state Private Noisily indicates other agents

21 MCES-P for Multiagent Settings MCES- P can almost be used as is in the multiagent setting Observations Public Noisily indicates physical state Private Noisily indicates other agents actions MCES- P has high computational costs arge neighborhood requiring k a samples each MCES for I- POMDPs: Explicitly models the opponent and significantly decreases sample requirements

22 MCES-IP Template MCES-P vs. MCES-IP MCES- P simulation and Q- update Pick random o and a Simulate π o, a generating τ Update Q *,,. with R 9,:;<, (τ) MCES- IP reasons about which actions the opponent took in the simulation prior to updating Pick random o and a Simulate π o, a generating τ Update belief over opponent models Calculate a x from most likely models. Update Q w *,,. with R 9,:;<, (τ)

23 MCES-IP Template Models MCES- IP maintains a set of models of the opponent, where a model = <history, policy tree> o1 a1 o2 o1 a2 o2 o1 a3 o2 a1 a1 a2 a2 a1 a2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 a1 a1 a1 a1 a2 a2 a2 a2 a2 a3 a3 a1 m1 m2 m3

24 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action m1 m2 m3 t=1 t=2 t=3

25 0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2

26 0.40 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1

27 0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = o j = 1 o = m1 m2 m3 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1 a j 2 = 3 a j = {2, 1, 3}

28 MCES-IP Template Updating Q-values Update counts and Q- values using a x. Q w. *,,. 1 α m, c w.,,. Q w. *,,. + α m, c w,,. R 9,:;<, (τ) So far, MCES- IP is more expensive than MCES- P The Q- table is now up to A x larger!

29 MCESIP+PAC PAC Bounds MCESIP+PAC has similar PAC bounds to MCESP+PAC k a = 2 Λ. w(π j ) ε P ln 2N δ a ε. w m, p, q = Λ.w π j, π j ' 1 2p ln 2 k a 1 N δ a ε 2 if p = q = k a otherwise if p = q < k a

30 MCESIP+PAC PAC Bounds Λ. w modifies the range of possible rewards Since the opponent action is known, the range of possible rewards may often be narrower a x 1 a x 2 a j a j resulting in the following proposition: Λ. w π j, π j ' Λ π j, π j '

31 MCESIP+PAC PAC Bounds MCESIP+PAC terminates when k a samples of the local neighborhood bears no better policy or for all neighbors π Q *? < Q * + ε ε(m, c,,., c,,*, ) With probability 1 δ 1. MCESIP+PAC picks transformations that are always better than the current policy 2. MCESIP+PAC terminates with a policy that is an ε- local optima

32 Policy Search Space Pruning

33 Policy Search Space Pruning Introduction Not all observation sequences occur with the same probability ow likelihood events are difficult to sample Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward Regret: The amount of expected value lost by avoiding simulating on these transformations

34 Pr 6% Pr 30% Policy Search Space Pruning Regret G regret 6.6 regret 33 G G G G G G

35 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

36 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

37 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

38 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

39 3 Domains Experiments Domains Multiagent Tiger Problem 3x2 UAV Problem

40 3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem

41 3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem

42 Experiments Domain Parameters Opponent follows a fixed strategy Single: Only one policy is ever used Mixed (Non- stationary environment): Randomly selects from 2 to 3 policies every new trajectory ε δ % regret horizon Multiagent Tiger % 3 3x2 UAV % 3 Money aundering % 3

43 Experiments Comparative Results Right: 2 runs comparing MCESP+PAC and MCESIP+PAC Right- top: Mixed strategy opponent Right- middle: Single strategy opponent

44 Experiments Pruning Pruning is crucial to tractability

45 Concluding Remarks Model- free R in multiagent settings Generalized from MCES- P MCES- IP models the opponent, more sample efficient when paired with PAC bounds Partially model- free Instantiated with PAC to provide ε- local optimality and search space pruning for improved scalability

46 Thank you! Q & A

47 Related Works Bayes- Adaptive POMDPs (Ross et al. 2007) Extended to MPOMDPs (Amato and Oliehoek 2013) Model- based R IMCQ- Alt for Dec- POMDPs (Banerjee et al. 2013) Quasi- model based intermediate calculation of model parameters Alternating each agent must take turns Bayes- Adaptive I- POMDPs (Ng et al. 2012) Model- based R Physical state perfectly observable

48 Background: Decision Processes Decision problem: how to optimize behavior to maximize reward? Choose the action that has the best expected outcome Agent Action Preferences Reward R(a)

49 Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)

50 Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)

51 Background: R A popular class of model- free R methods are the temporal difference learning models Example: Q- learning Q s, a; α = 1 α Q s, a + α r s, a + γ max.' Q(s, a ' ) α: earning rate γ: Discount factor Computes action- values from a state by exploring new values and exploiting previous knowledge

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,