Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Similar documents
Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Grundlagen der Künstlichen Intelligenz

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Basics of reinforcement learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Markov decision processes

Artificial Intelligence

Markov Decision Processes (and a small amount of reinforcement learning)

RL 14: Simplifications of POMDPs

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Partially Observable Markov Decision Processes (POMDPs)

Bayesian Congestion Control over a Markovian Network Bandwidth Process

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement learning an introduction

Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs

Learning in Zero-Sum Team Markov Games using Factored Value Functions

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

CS 7180: Behavioral Modeling and Decisionmaking

Q-Learning in Continuous State Action Spaces

CS599 Lecture 1 Introduction To RL

Open Theoretical Questions in Reinforcement Learning

Q-learning. Tambet Matiisen

Reinforcement Learning. Yishay Mansour Tel-Aviv University

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

arxiv: v2 [cs.ai] 20 Dec 2014

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Efficient Maximization in Solving POMDPs

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Information Gathering and Reward Exploitation of Subgoals for P

Learning for Multiagent Decentralized Control in Large Partially Observable Stochastic Environments

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Evaluation of multi armed bandit algorithms and empirical algorithm

Decision Theory: Q-Learning

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning and Control

Lecture 3: Markov Decision Processes

Distributed Optimization. Song Chong EE, KAIST

Lecture 8: Policy Gradient

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS 4649/7649 Robot Intelligence: Planning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Decentralized Decision Making!

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Lecture 9: Policy Gradient II (Post lecture) 2

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

RL 14: POMDPs continued

A Model of Human Capital Accumulation and Occupational Choices. A simplified version of Keane and Wolpin (JPE, 1997)

Motivation for introducing probabilities

Reinforcement Learning Active Learning

Reinforcement Learning. Introduction

Lecture 23: Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

Individual Planning in Infinite-Horizon Multiagent Settings: Inference, Structure and Scalability

Notes on Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP

Artificial Intelligence & Sequential Decision Problems

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

Multiagent (Deep) Reinforcement Learning

Bayesian Active Learning With Basis Functions

16.410/413 Principles of Autonomy and Decision Making

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

16.4 Multiattribute Utility Functions

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

1 MDP Value Iteration Algorithm

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS 234 Midterm - Winter

Reinforcement Learning

Exploration. 2015/10/12 John Schulman

Reinforcement learning

Preference Elicitation for Sequential Decision Problems

Decayed Markov Chain Monte Carlo for Interactive POMDPs

Low-Regret for Online Decision-Making

Reinforcement Learning. George Konidaris

Reinforcement Learning

Machine Learning I Reinforcement Learning

Food delivered. Food obtained S 3

Approximate Universal Artificial Intelligence

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Introduction to Reinforcement Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Transcription:

Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University of Georgia pdoshi@cs.uga.edu Bikramjit Banerjee University of Southern Mississippi bikramjit.banerjee@usm.edu

Introduction Model- free reinforcement learning in multiagent systems is a nascent field Monte Carlo Exploring Starts for POMDPs is a powerful single- agent R technique Policy iteration leveraging Q- learning to hill- climb through the local policy space to local optima Allows PAC bounds to select sample complexity with confidence

Introduction We extend MCES- P to the non- cooperative multiagent setting and introduce MCES for Interactive POMDPs Explicitly models the opponent Predicates action- values on expected opponent behavior When instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity We additionally provide a policy space pruning mechanism to promote scalability Parametrically bounds regret from avoiding policies Prioritizes eliminating low- regret policy transformations

Background: Multiagent Decision Process In the multiagent setting, all agents affect the state and the reward for each agent Physical State Action Action Agent i Action (Joint) Rewards Action Agent j Reward R(s,a i,a j ) Reward

Background: I-POMDP The Interactive POMDP (I- POMDP) (Gmytrasiewicz and Doshi 2005) <IS,A,T,Ω,O,R> Non- cooperative: Agents get individual, potentially competitive rewards Actions A, state transitions T, observations Ω, observation probabilities O, and rewards R IS: Interactive state, combining the physical state and a model of the other agent Significant uncertainty Must reason not only the physical state, but also the opponent s motivations and beliefs

Background: MCES-P Template Monte Carlo Exploring Starts for POMDPs (MCES- P) (Perkins - AAAI 2002) General template Explore neighborhood of π - all policies that differ by a single action a on some observation sequence o Compute expected value by simulating policies online Hill climb to policies with better values Terminate if no neighbor is better than the current policy

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action o1 a2 o2 {o1,o2}: a1 à a3 o1 a2 o2 a1 a3 a1 a3 o1 o2 o1 o2 o1 o2 o1 o2 a3 a1 a2 a2 a3 a3 a2 a2

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π {o1,o2}: a1 ß à a3 π '

Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π ' o1: a1 ß à a2 {o1,o2}: a1 ß à a2 π ' π ' o1: a1 ß à a3 π {o1,o2}: a1 ß à a3 π ' : a2 ß à a1 : a2 ß à a3 π ' π '

Background: MCES-P Template Transformation ocal Neighborhood

Background: MCES-P Template Sampling Pick random action and simulate a3 Q *,,. 1 α(m, c,,. ) Q *,,. + α m, c,,. R 9,:;<, (τ)

Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π

Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π

Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε

Background: MCES-P Template Termination When all neighbors sampled k times and no neighbor is better

Background: MCESP+PAC Problem: Choosing a good sample bound k ow values of k increase the chance we make errors when transforming High values, though requiring more samples, guarantee we hill- climb correctly High Error Probability ow Error Probability Inaccurate Q- values Accurate Q- values

Background: MCESP+PAC Solution: Pick a k that guarantees some confidence on the accuracy of the Q- value Probably Approximately Correct (PAC) earning The probability of the sample average deviating from the true mean by more than variance ε is bound by error δ Pr XG μ > ε 2 exp 2k ε Λ P = δ

Background: MCESP+PAC With ε and δ, we calculate required samples to satisfy the error bound m is the number of current transformations N is number of neighbor policies δ a = bc a d * d Λ π ', π max g k a = 2 Λ(π) ε P ln 2N δ a (Q * Q *?) min(q * Q *?) 2T R a.i R ajk Λ π = g max Λ(π, *? kmjnop,q * π' )

Background: MCESP+PAC We can transform early by modifying ε ε m, p, q = Λ π, π ' 1 2p ln 2 k a 1 N ε δ a 2 if p = q = k a otherwise if p = q < k a Terminate when k a samples of each neighbor is taken or for all neighbor policies: Q,,. < Q,,*(,) + ε ε(m, c,,., c,,*, )

Background: MCESP+PAC Then, with probability 1 δ 1. MCESP+PAC picks transformations that are always better than the current policy 2. MCESP+PAC terminates with a policy that is an ε- local optima That is, there is no neighbor that is better than the last policy by more than ε

MCES-P for Multiagent Settings MCES- P can almost be used as is in the multiagent setting Observations Public Noisily indicates physical state Private Noisily indicates other agents actions MCES- P has high computational costs arge neighborhood requiring k a samples each MCES for I- POMDPs: Explicitly models the opponent and significantly decreases sample requirements

MCES-IP Template MCES-P vs. MCES-IP MCES- P simulation and Q- update Pick random o and a Simulate π o, a generating τ Update Q *,,. with R 9,:;<, (τ) MCES- IP reasons about which actions the opponent took in the simulation prior to updating Pick random o and a Simulate π o, a generating τ Update belief over opponent models Calculate a x from most likely models. Update Q w *,,. with R 9,:;<, (τ)

MCES-IP Template Models MCES- IP maintains a set of models of the opponent, where a model = <history, policy tree> o1 a1 o2 o1 a2 o2 o1 a3 o2 a1 a1 a2 a2 a1 a2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 a1 a1 a1 a1 a2 a2 a2 a2 a2 a3 a3 a1 m1 m2 m3

MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 0.4 0.2 0 m1 m2 m3 t=1 t=2 t=3

0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 0.2 0 0.50 0.00 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2

0.40 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = 1 0.20 0.00 0.50 0.00 0.50 0.00 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1

0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = 1 1.00 o j = 1 o = 3 0.2 0.50 0.50 0.50 0 0.00 0.00 0.00 m1 m2 m3 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1 a j 2 = 3 a j = {2, 1, 3}

MCES-IP Template Updating Q-values Update counts and Q- values using a x. Q w. *,,. 1 α m, c w.,,. Q w. *,,. + α m, c w,,. R 9,:;<, (τ) So far, MCES- IP is more expensive than MCES- P The Q- table is now up to A x larger!

MCESIP+PAC PAC Bounds MCESIP+PAC has similar PAC bounds to MCESP+PAC k a = 2 Λ. w(π j ) ε P ln 2N δ a ε. w m, p, q = Λ.w π j, π j ' 1 2p ln 2 k a 1 N δ a ε 2 if p = q = k a otherwise if p = q < k a

MCESIP+PAC PAC Bounds Λ. w modifies the range of possible rewards Since the opponent action is known, the range of possible rewards may often be narrower a x 1 a x 2 a j 1 0 3 a j 2 4 5 resulting in the following proposition: Λ. w π j, π j ' Λ π j, π j '

MCESIP+PAC PAC Bounds MCESIP+PAC terminates when k a samples of the local neighborhood bears no better policy or for all neighbors π Q *? < Q * + ε ε(m, c,,., c,,*, ) With probability 1 δ 1. MCESIP+PAC picks transformations that are always better than the current policy 2. MCESIP+PAC terminates with a policy that is an ε- local optima

Policy Search Space Pruning

Policy Search Space Pruning Introduction Not all observation sequences occur with the same probability ow likelihood events are difficult to sample Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward Regret: The amount of expected value lost by avoiding simulating on these transformations

Pr 6% Pr 30% Policy Search Space Pruning Regret G regret 6.6 regret 33 G G G G G G

100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%

3 Domains Experiments Domains Multiagent Tiger Problem 3x2 UAV Problem

3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem

3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem

Experiments Domain Parameters Opponent follows a fixed strategy Single: Only one policy is ever used Mixed (Non- stationary environment): Randomly selects from 2 to 3 policies every new trajectory ε δ % regret horizon Multiagent Tiger 0.05 0.1 15% 3 3x2 UAV 0.1 0.1 20% 3 Money aundering 0.1 0.15 20% 3

Experiments Comparative Results Right: 2 runs comparing MCESP+PAC and MCESIP+PAC Right- top: Mixed strategy opponent Right- middle: Single strategy opponent

Experiments Pruning Pruning is crucial to tractability 7.59 5.94 8.37

Concluding Remarks Model- free R in multiagent settings Generalized from MCES- P MCES- IP models the opponent, more sample efficient when paired with PAC bounds Partially model- free Instantiated with PAC to provide ε- local optimality and search space pruning for improved scalability

Thank you! Q & A

Related Works Bayes- Adaptive POMDPs (Ross et al. 2007) Extended to MPOMDPs (Amato and Oliehoek 2013) Model- based R IMCQ- Alt for Dec- POMDPs (Banerjee et al. 2013) Quasi- model based intermediate calculation of model parameters Alternating each agent must take turns Bayes- Adaptive I- POMDPs (Ng et al. 2012) Model- based R Physical state perfectly observable

Background: Decision Processes Decision problem: how to optimize behavior to maximize reward? Choose the action that has the best expected outcome Agent Action Preferences Reward R(a)

Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)

Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)

Background: R A popular class of model- free R methods are the temporal difference learning models Example: Q- learning Q s, a; α = 1 α Q s, a + α r s, a + γ max.' Q(s, a ' ) α: earning rate γ: Discount factor Computes action- values from a state by exploring new values and exploiting previous knowledge