Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University of Georgia pdoshi@cs.uga.edu Bikramjit Banerjee University of Southern Mississippi bikramjit.banerjee@usm.edu

Introduction Model- free reinforcement learning in multiagent systems is a nascent field Monte Carlo Exploring Starts for POMDPs is a powerful single- agent R technique Policy iteration leveraging Q- learning to hill- climb through the local policy space to local optima Allows PAC bounds to select sample complexity with confidence

Introduction We extend MCES- P to the non- cooperative multiagent setting and introduce MCES for Interactive POMDPs Explicitly models the opponent Predicates action- values on expected opponent behavior When instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity We additionally provide a policy space pruning mechanism to promote scalability Parametrically bounds regret from avoiding policies Prioritizes eliminating low- regret policy transformations

Background: Multiagent Decision Process In the multiagent setting, all agents affect the state and the reward for each agent Physical State Action Action Agent i Action (Joint) Rewards Action Agent j Reward R(s,a i,a j ) Reward

Background: I-POMDP The Interactive POMDP (I- POMDP) (Gmytrasiewicz and Doshi 2005) <IS,A,T,Ω,O,R> Non- cooperative: Agents get individual, potentially competitive rewards Actions A, state transitions T, observations Ω, observation probabilities O, and rewards R IS: Interactive state, combining the physical state and a model of the other agent Significant uncertainty Must reason not only the physical state, but also the opponent s motivations and beliefs

Background: MCES-P Template Monte Carlo Exploring Starts for POMDPs (MCES- P) (Perkins - AAAI 2002) General template Explore neighborhood of π - all policies that differ by a single action a on some observation sequence o Compute expected value by simulating policies online Hill climb to policies with better values Terminate if no neighbor is better than the current policy

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action o1 a2 o2 {o1,o2}: a1 à a3 o1 a2 o2 a1 a3 a1 a3 o1 o2 o1 o2 o1 o2 o1 o2 a3 a1 a2 a2 a3 a3 a2 a2

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π {o1,o2}: a1 ß à a3 π '

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π ' o1: a1 ß à a2 {o1,o2}: a1 ß à a2 π ' π ' o1: a1 ß à a3 π {o1,o2}: a1 ß à a3 π ' : a2 ß à a1 : a2 ß à a3 π ' π '

Background: MCES-P Template Transformation ocal Neighborhood

Background: MCES-P Template Sampling Pick random action and simulate a3 Q *,,. 1 α(m, c,,. ) Q *,,. + α m, c,,. R 9,:;<, (τ)

Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π

Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε

Background: MCES-P Template Termination When all neighbors sampled k times and no neighbor is better

Background: MCESP+PAC Problem: Choosing a good sample bound k ow values of k increase the chance we make errors when transforming High values, though requiring more samples, guarantee we hill- climb correctly High Error Probability ow Error Probability Inaccurate Q- values Accurate Q- values

Background: MCESP+PAC Solution: Pick a k that guarantees some confidence on the accuracy of the Q- value Probably Approximately Correct (PAC) earning The probability of the sample average deviating from the true mean by more than variance ε is bound by error δ Pr XG μ > ε 2 exp 2k ε Λ P = δ

Background: MCESP+PAC With ε and δ, we calculate required samples to satisfy the error bound m is the number of current transformations N is number of neighbor policies δ a = bc a d * d Λ π ', π max g k a = 2 Λ(π) ε P ln 2N δ a (Q * Q *?) min(q * Q *?) 2T R a.i R ajk Λ π = g max Λ(π, *? kmjnop,q * π' )

Background: MCESP+PAC We can transform early by modifying ε ε m, p, q = Λ π, π ' 1 2p ln 2 k a 1 N ε δ a 2 if p = q = k a otherwise if p = q < k a Terminate when k a samples of each neighbor is taken or for all neighbor policies: Q,,. < Q,,*(,) + ε ε(m, c,,., c,,*, )

Background: MCESP+PAC Then, with probability 1 δ 1. MCESP+PAC picks transformations that are always better than the current policy 2. MCESP+PAC terminates with a policy that is an ε- local optima That is, there is no neighbor that is better than the last policy by more than ε

MCES-P for Multiagent Settings MCES- P can almost be used as is in the multiagent setting Observations Public Noisily indicates physical state Private Noisily indicates other agents actions MCES- P has high computational costs arge neighborhood requiring k a samples each MCES for I- POMDPs: Explicitly models the opponent and significantly decreases sample requirements

MCES-IP Template MCES-P vs. MCES-IP MCES- P simulation and Q- update Pick random o and a Simulate π o, a generating τ Update Q *,,. with R 9,:;<, (τ) MCES- IP reasons about which actions the opponent took in the simulation prior to updating Pick random o and a Simulate π o, a generating τ Update belief over opponent models Calculate a x from most likely models. Update Q w *,,. with R 9,:;<, (τ)

MCES-IP Template Models MCES- IP maintains a set of models of the opponent, where a model = <history, policy tree> o1 a1 o2 o1 a2 o2 o1 a3 o2 a1 a1 a2 a2 a1 a2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 a1 a1 a1 a1 a2 a2 a2 a2 a2 a3 a3 a1 m1 m2 m3

MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 0.4 0.2 0 m1 m2 m3 t=1 t=2 t=3

0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 0.2 0 0.50 0.00 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2

0.40 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = 1 0.20 0.00 0.50 0.00 0.50 0.00 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1

0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = 1 1.00 o j = 1 o = 3 0.2 0.50 0.50 0.50 0 0.00 0.00 0.00 m1 m2 m3 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1 a j 2 = 3 a j = {2, 1, 3}

MCES-IP Template Updating Q-values Update counts and Q- values using a x. Q w. *,,. 1 α m, c w.,,. Q w. *,,. + α m, c w,,. R 9,:;<, (τ) So far, MCES- IP is more expensive than MCES- P The Q- table is now up to A x larger!

MCESIP+PAC PAC Bounds MCESIP+PAC has similar PAC bounds to MCESP+PAC k a = 2 Λ. w(π j ) ε P ln 2N δ a ε. w m, p, q = Λ.w π j, π j ' 1 2p ln 2 k a 1 N δ a ε 2 if p = q = k a otherwise if p = q < k a

MCESIP+PAC PAC Bounds Λ. w modifies the range of possible rewards Since the opponent action is known, the range of possible rewards may often be narrower a x 1 a x 2 a j 1 0 3 a j 2 4 5 resulting in the following proposition: Λ. w π j, π j ' Λ π j, π j '

MCESIP+PAC PAC Bounds MCESIP+PAC terminates when k a samples of the local neighborhood bears no better policy or for all neighbors π Q *? < Q * + ε ε(m, c,,., c,,*, ) With probability 1 δ 1. MCESIP+PAC picks transformations that are always better than the current policy 2. MCESIP+PAC terminates with a policy that is an ε- local optima

Policy Search Space Pruning

Policy Search Space Pruning Introduction Not all observation sequences occur with the same probability ow likelihood events are difficult to sample Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward Regret: The amount of expected value lost by avoiding simulating on these transformations

Pr 6% Pr 30% Policy Search Space Pruning Regret G regret 6.6 regret 33 G G G G G G

100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

3 Domains Experiments Domains Multiagent Tiger Problem 3x2 UAV Problem

3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem

Experiments Domain Parameters Opponent follows a fixed strategy Single: Only one policy is ever used Mixed (Non- stationary environment): Randomly selects from 2 to 3 policies every new trajectory ε δ % regret horizon Multiagent Tiger 0.05 0.1 15% 3 3x2 UAV 0.1 0.1 20% 3 Money aundering 0.1 0.15 20% 3

Experiments Comparative Results Right: 2 runs comparing MCESP+PAC and MCESIP+PAC Right- top: Mixed strategy opponent Right- middle: Single strategy opponent

Experiments Pruning Pruning is crucial to tractability 7.59 5.94 8.37

Concluding Remarks Model- free R in multiagent settings Generalized from MCES- P MCES- IP models the opponent, more sample efficient when paired with PAC bounds Partially model- free Instantiated with PAC to provide ε- local optimality and search space pruning for improved scalability

Thank you! Q & A

Related Works Bayes- Adaptive POMDPs (Ross et al. 2007) Extended to MPOMDPs (Amato and Oliehoek 2013) Model- based R IMCQ- Alt for Dec- POMDPs (Banerjee et al. 2013) Quasi- model based intermediate calculation of model parameters Alternating each agent must take turns Bayes- Adaptive I- POMDPs (Ng et al. 2012) Model- based R Physical state perfectly observable

Background: Decision Processes Decision problem: how to optimize behavior to maximize reward? Choose the action that has the best expected outcome Agent Action Preferences Reward R(a)

Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)

Background: R A popular class of model- free R methods are the temporal difference learning models Example: Q- learning Q s, a; α = 1 α Q s, a + α r s, a + γ max.' Q(s, a ' ) α: earning rate γ: Discount factor Computes action- values from a state by exploring new values and exploiting previous knowledge