Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies
|
|
- Hilary Bridges
- 5 years ago
- Views:
Transcription
1 Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia Prashant Doshi THINC ab, University of Georgia Bikramjit Banerjee University of Southern Mississippi
2 Introduction Model- free reinforcement learning in multiagent systems is a nascent field Monte Carlo Exploring Starts for POMDPs is a powerful single- agent R technique Policy iteration leveraging Q- learning to hill- climb through the local policy space to local optima Allows PAC bounds to select sample complexity with confidence
3 Introduction We extend MCES- P to the non- cooperative multiagent setting and introduce MCES for Interactive POMDPs Explicitly models the opponent Predicates action- values on expected opponent behavior When instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity We additionally provide a policy space pruning mechanism to promote scalability Parametrically bounds regret from avoiding policies Prioritizes eliminating low- regret policy transformations
4 Background: Multiagent Decision Process In the multiagent setting, all agents affect the state and the reward for each agent Physical State Action Action Agent i Action (Joint) Rewards Action Agent j Reward R(s,a i,a j ) Reward
5 Background: I-POMDP The Interactive POMDP (I- POMDP) (Gmytrasiewicz and Doshi 2005) <IS,A,T,Ω,O,R> Non- cooperative: Agents get individual, potentially competitive rewards Actions A, state transitions T, observations Ω, observation probabilities O, and rewards R IS: Interactive state, combining the physical state and a model of the other agent Significant uncertainty Must reason not only the physical state, but also the opponent s motivations and beliefs
6 Background: MCES-P Template Monte Carlo Exploring Starts for POMDPs (MCES- P) (Perkins - AAAI 2002) General template Explore neighborhood of π - all policies that differ by a single action a on some observation sequence o Compute expected value by simulating policies online Hill climb to policies with better values Terminate if no neighbor is better than the current policy
7 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action o1 a2 o2 {o1,o2}: a1 à a3 o1 a2 o2 a1 a3 a1 a3 o1 o2 o1 o2 o1 o2 o1 o2 a3 a1 a2 a2 a3 a3 a2 a2
8 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π {o1,o2}: a1 ß à a3 π '
9 Background: MCES-P Template Transformation Pick random observation sequence and replace with a random action π ' o1: a1 ß à a2 {o1,o2}: a1 ß à a2 π ' π ' o1: a1 ß à a3 π {o1,o2}: a1 ß à a3 π ' : a2 ß à a1 : a2 ß à a3 π ' π '
10 Background: MCES-P Template Transformation ocal Neighborhood
11 Background: MCES-P Template Sampling Pick random action and simulate a3 Q *,,. 1 α(m, c,,. ) Q *,,. + α m, c,,. R 9,:;<, (τ)
12 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π
13 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε π π
14 Background: MCES-P Template Sampling Sample neighborhood k times for each policy Q *? > Q * + ε
15 Background: MCES-P Template Termination When all neighbors sampled k times and no neighbor is better
16 Background: MCESP+PAC Problem: Choosing a good sample bound k ow values of k increase the chance we make errors when transforming High values, though requiring more samples, guarantee we hill- climb correctly High Error Probability ow Error Probability Inaccurate Q- values Accurate Q- values
17 Background: MCESP+PAC Solution: Pick a k that guarantees some confidence on the accuracy of the Q- value Probably Approximately Correct (PAC) earning The probability of the sample average deviating from the true mean by more than variance ε is bound by error δ Pr XG μ > ε 2 exp 2k ε Λ P = δ
18 Background: MCESP+PAC With ε and δ, we calculate required samples to satisfy the error bound m is the number of current transformations N is number of neighbor policies δ a = bc a d * d Λ π ', π max g k a = 2 Λ(π) ε P ln 2N δ a (Q * Q *?) min(q * Q *?) 2T R a.i R ajk Λ π = g max Λ(π, *? kmjnop,q * π' )
19 Background: MCESP+PAC We can transform early by modifying ε ε m, p, q = Λ π, π ' 1 2p ln 2 k a 1 N ε δ a 2 if p = q = k a otherwise if p = q < k a Terminate when k a samples of each neighbor is taken or for all neighbor policies: Q,,. < Q,,*(,) + ε ε(m, c,,., c,,*, )
20 Background: MCESP+PAC Then, with probability 1 δ 1. MCESP+PAC picks transformations that are always better than the current policy 2. MCESP+PAC terminates with a policy that is an ε- local optima That is, there is no neighbor that is better than the last policy by more than ε
21 MCES-P for Multiagent Settings MCES- P can almost be used as is in the multiagent setting Observations Public Noisily indicates physical state Private Noisily indicates other agents actions MCES- P has high computational costs arge neighborhood requiring k a samples each MCES for I- POMDPs: Explicitly models the opponent and significantly decreases sample requirements
22 MCES-IP Template MCES-P vs. MCES-IP MCES- P simulation and Q- update Pick random o and a Simulate π o, a generating τ Update Q *,,. with R 9,:;<, (τ) MCES- IP reasons about which actions the opponent took in the simulation prior to updating Pick random o and a Simulate π o, a generating τ Update belief over opponent models Calculate a x from most likely models. Update Q w *,,. with R 9,:;<, (τ)
23 MCES-IP Template Models MCES- IP maintains a set of models of the opponent, where a model = <history, policy tree> o1 a1 o2 o1 a2 o2 o1 a3 o2 a1 a1 a2 a2 a1 a2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 o1 o2 a1 a1 a1 a1 a2 a2 a2 a2 a2 a3 a3 a1 m1 m2 m3
24 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action m1 m2 m3 t=1 t=2 t=3
25 0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2
26 0.40 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1
27 0.4 MCES-IP Template Generating a x Every round, MCES- IP updates the most probable model and selects the most probable action 1.00 o j = 2 o = 1.00 o j = 1 o = o j = 1 o = m1 m2 m3 m1 m2 m3 m1 m2 m3 m1 m2 m3 t=1 t=2 t=3 a j 0 = 2 a j 1 = 1 a j 2 = 3 a j = {2, 1, 3}
28 MCES-IP Template Updating Q-values Update counts and Q- values using a x. Q w. *,,. 1 α m, c w.,,. Q w. *,,. + α m, c w,,. R 9,:;<, (τ) So far, MCES- IP is more expensive than MCES- P The Q- table is now up to A x larger!
29 MCESIP+PAC PAC Bounds MCESIP+PAC has similar PAC bounds to MCESP+PAC k a = 2 Λ. w(π j ) ε P ln 2N δ a ε. w m, p, q = Λ.w π j, π j ' 1 2p ln 2 k a 1 N δ a ε 2 if p = q = k a otherwise if p = q < k a
30 MCESIP+PAC PAC Bounds Λ. w modifies the range of possible rewards Since the opponent action is known, the range of possible rewards may often be narrower a x 1 a x 2 a j a j resulting in the following proposition: Λ. w π j, π j ' Λ π j, π j '
31 MCESIP+PAC PAC Bounds MCESIP+PAC terminates when k a samples of the local neighborhood bears no better policy or for all neighbors π Q *? < Q * + ε ε(m, c,,., c,,*, ) With probability 1 δ 1. MCESIP+PAC picks transformations that are always better than the current policy 2. MCESIP+PAC terminates with a policy that is an ε- local optima
32 Policy Search Space Pruning
33 Policy Search Space Pruning Introduction Not all observation sequences occur with the same probability ow likelihood events are difficult to sample Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward Regret: The amount of expected value lost by avoiding simulating on these transformations
34 Pr 6% Pr 30% Policy Search Space Pruning Regret G regret 6.6 regret 33 G G G G G G
35 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%
36 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%
37 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%
38 100% Allowable regret Policy Search Space Pruning G Allowed transformations G G G G G G 0%
39 3 Domains Experiments Domains Multiagent Tiger Problem 3x2 UAV Problem
40 3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem
41 3 Domains Experiments Domains Placement ayering Integration bank offshore casinos insurance shell companies real estate Money aundering (M) Problem
42 Experiments Domain Parameters Opponent follows a fixed strategy Single: Only one policy is ever used Mixed (Non- stationary environment): Randomly selects from 2 to 3 policies every new trajectory ε δ % regret horizon Multiagent Tiger % 3 3x2 UAV % 3 Money aundering % 3
43 Experiments Comparative Results Right: 2 runs comparing MCESP+PAC and MCESIP+PAC Right- top: Mixed strategy opponent Right- middle: Single strategy opponent
44 Experiments Pruning Pruning is crucial to tractability
45 Concluding Remarks Model- free R in multiagent settings Generalized from MCES- P MCES- IP models the opponent, more sample efficient when paired with PAC bounds Partially model- free Instantiated with PAC to provide ε- local optimality and search space pruning for improved scalability
46 Thank you! Q & A
47 Related Works Bayes- Adaptive POMDPs (Ross et al. 2007) Extended to MPOMDPs (Amato and Oliehoek 2013) Model- based R IMCQ- Alt for Dec- POMDPs (Banerjee et al. 2013) Quasi- model based intermediate calculation of model parameters Alternating each agent must take turns Bayes- Adaptive I- POMDPs (Ng et al. 2012) Model- based R Physical state perfectly observable
48 Background: Decision Processes Decision problem: how to optimize behavior to maximize reward? Choose the action that has the best expected outcome Agent Action Preferences Reward R(a)
49 Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)
50 Background : Decision Processes Physical State Action Agent Action Preferences Reward R(s,a)
51 Background: R A popular class of model- free R methods are the temporal difference learning models Example: Q- learning Q s, a; α = 1 α Q s, a + α r s, a + γ max.' Q(s, a ' ) α: earning rate γ: Discount factor Computes action- values from a state by exploring new values and exploiting previous knowledge
Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationOptimally Solving Dec-POMDPs as Continuous-State MDPs
Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationRL 14: Simplifications of POMDPs
RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationA Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty
2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationBayesian Congestion Control over a Markovian Network Bandwidth Process
Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationPruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs
Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs Bikramjit Banerjee School of Computing The University of Southern Mississippi Hattiesburg, MS 39402 Bikramjit.Banerjee@usm.edu
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationA reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation
A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationarxiv: v2 [cs.ai] 20 Dec 2014
Scalable Planning and Learning for Multiagent POMDPs: Extended Version arxiv:1404.1140v2 [cs.ai] 20 Dec 2014 Christopher Amato CSAIL, MIT Cambridge, MA 02139 camato@csail.mit.edu Frans A. Oliehoek Informatics
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationEfficient Maximization in Solving POMDPs
Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationInformation Gathering and Reward Exploitation of Subgoals for P
Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015 http://www.cs.washington.edu/ai/mobile_robotics/mcl/animations/global-floor.gif
More informationLearning for Multiagent Decentralized Control in Large Partially Observable Stochastic Environments
Learning for Multiagent Decentralized Control in Large Partially Observable Stochastic Environments Miao Liu Laboratory for Information and Decision Systems Cambridge, MA 02139 miaoliu@mit.edu Christopher
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationOptimizing Memory-Bounded Controllers for Decentralized POMDPs
Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003
More informationFinal. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.
CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationCS 4649/7649 Robot Intelligence: Planning
CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationDecentralized Decision Making!
Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher
More informationBayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem
Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationRL 14: POMDPs continued
RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationA Model of Human Capital Accumulation and Occupational Choices. A simplified version of Keane and Wolpin (JPE, 1997)
A Model of Human Capital Accumulation and Occupational Choices A simplified version of Keane and Wolpin (JPE, 1997) We have here three, mutually exclusive decisions in each period: 1. Attend school. 2.
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationIndividual Planning in Infinite-Horizon Multiagent Settings: Inference, Structure and Scalability
Individual Planning in Infinite-Horizon Multiagent Settings: Inference, Structure and Scalability Xia Qu Epic Systems Verona, WI 53593 quxiapisces@gmail.com Prashant Doshi THINC Lab, Dept. of Computer
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationSample Bounded Distributed Reinforcement Learning for Decentralized POMDPs
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs Bikramjit Banerjee 1, Jeremy Lyle 2, Landon Kraemer
More informationMinimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP
Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Jiaying Shen Department of Computer Science University of Massachusetts Amherst, MA 0003-460, USA jyshen@cs.umass.edu
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationBayesian reinforcement learning and partially observable Markov decision processes November 6, / 24
and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24
More informationReinforcement Learning
Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationEuropean Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University
European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making
More informationMultiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationOutline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012
CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationCS 234 Midterm - Winter
CS 234 Midterm - Winter 2017-18 **Do not turn this page until you are instructed to do so. Instructions Please answer the following questions to the best of your ability. Read all the questions first before
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationDecayed Markov Chain Monte Carlo for Interactive POMDPs
Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {yhan37,piotr}@uic.edu Abstract
More informationLow-Regret for Online Decision-Making
Siddhartha Banerjee and Alberto Vera November 6, 2018 1/17 Introduction Compensated Coupling Bayes Selector Conclusion Motivation 2/17 Introduction Compensated Coupling Bayes Selector Conclusion Motivation
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationFood delivered. Food obtained S 3
Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National
More informationBayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty
Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 stephane.ross@mail.mcgill.ca
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Final You have approximately 2 hours 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib
More information