POMDP S : Exact and Approximate Solutions

Size: px
Start display at page:

Download "POMDP S : Exact and Approximate Solutions"

Transcription

1 POMDP S : Exact and Approximate Solutions Bharaneedharan R University of Illinois at Chicago Automated optimal decision making. Fall 2002 p.1/21

2 Road-map Definition of POMDP Belief as a sufficient statistic General solutions to a POMDP Monahan s enumeration algorithm Incremental pruning Approximations MDP based approximations Grid based approximation Automated optimal decision making. Fall 2002 p.2/21

3 Definition of POMDP A POMDP is defined by the tuple <S, A, T, θ, O, R > S is a finite set of world states. A is a finite set of actions that can be performed by the agent. T : S A S [0,1], the likelihood of an action changing the system state from one to another. θ is the finite set of observations that the agent can make (sensory inputs available for the agent). O : S θ A [0,1], the likelihood of making a certain observation at a state after performing a particular action. R : S A S R, defines payoff s for the agent in a given system state after performing an action. Automated optimal decision making. Fall 2002 p.3/21

4 Belief s and Information States Let the information state process(isp) be I 0, I 1,... I t I t = [o t, a t 1, I t 1 ] Belief or the probability distribution over states at any time t is given as Be j (t) = P (S(t) = j I t 1 ) I 0 is the information state before agents start performing actions. If observations result only after performing actions I 0 = Be(0), the prior at time t=0 Redefining ISP I 0 = Be(0), I 1 = [o 1, a 0, I 0 ],... Automated optimal decision making. Fall 2002 p.4/21

5 Belief is a sufficient statistic Be j (t) = P (S(t) = j o t, a t 1, I t 1 ) At t=0, Be(0) is just a prior At t=1, Be j (1) = P (S(t) = j o 1, a 0, Be j (0)) We see that at any time t, Be(t) will be dependent on current observation, previous action and previous belief state. Also current observation and state is dependent on only previous action and information state Consequently Be(t) is a compact representation of I t Automated optimal decision making. Fall 2002 p.5/21

6 General solution to a POMDP Let b i (S) for all i V t (b i ) = Max a A R(b i a) + γ o θ P (o b i, a) V t 1 (τ(b i, o, a)) α t (s i, a) = R(s i, a) + γ o θ s j S P (o s j, a) P (s j s i, a) α t 1 α t 1 is an alpha vector from the previous epoch for a given observation o. o1 o1 α1(t) o1 o2 α1(t-1) α2(t-1) o2 o3 α1(t) o1 o2 α2(t-1) α1(t-1) o2 o3 o3 o3 α3(t-1) α2(t-1) Automated optimal decision making. Fall 2002 p.6/21

7 Continued... Each choice of α t 1 for a given o is a choice of a future plan given o. α t is constructed using the best of such choices of plans for all possible observations. α1(t) o1 o2 o3 α1(t-1) α1(t-1) α1(t-1) α1(t-1) α1(t-1) α2(t-1) α1(t-1) α1(t-1) α3(t-1) : : α3(t-1) α1(t-1) α2(t-1) : : α3(t-1) α3(t-1) α3(t-1) Automated optimal decision making. Fall 2002 p.7/21

8 Piecewise linearity and consequences We know that the value function is piecewise linear i.e. V t (b i ) = Max α k t α k t b i Computing the value function involves just projecting the alpha vectors. But alpha vectors are linear over the belief space and hence only the values simplex corners represented in alpha vectors need to be computed. Automated optimal decision making. Fall 2002 p.8/21

9 Enumeration algorithm [Monahan 82] At a given epoch, project all the alpha vectors describing the value function at previous epoch. If we have Γ t is the set of alpha vectors at epoch t, Γ t = A Γ t 1 Θ Find the useful set of alpha vectors that forms the value function Useless alpha vectors Automated optimal decision making. Fall 2002 p.9/21

10 Testing for redundancy Test each alpha vector if it maximizes the value function at least at one point in the belief space Let π i π = Be(s = i) and α k be the vector to be tested α k is an useful vector iff for all vectors j k, i π iα j i π i α k is satisfied at some π i.e. i π i(α j α k ) 0 To find the point where the vector maximizes the most we rewrite the above as an LP maximize : δ i π i(α j α k ) + δ 0 for each alpha vector j k i π i = 1 π i 0 This leads to an LP with ( Γ 1) + 1+ S constraints and S +1 variables Automated optimal decision making. Fall 2002 p.10/21

11 Problems with simple enumeration Brute force generate and test For a problem with 30 states, 4 actions and 8 observations, we would need about 60MB of memory to store the new vectors in epoch 2. With 12 observations thats a whopping 15GB. Useful only for solving very small problems Automated optimal decision making. Fall 2002 p.11/21

12 Incremental pruning [zhang, Liu 96] Extension to the enumeration algorithm, using interleaved generation and redundancy testing. The key is in breaking up the dynamic programming update of the value function. V (b) = max a A V a (b) V a (b) = V a o (b) = o θ V a o (b) s R(a,s)b(s) θ + γp (o b, a)v (b a o) The maximizing set of alpha vectors is determined from bottom to top. Leading to iterative purging of smaller set of alpha vectors Automated optimal decision making. Fall 2002 p.12/21

13 Partial projection and pruning Let W, W a, W a o be the set of vectors describing V, V a, V a o W = purge a A W a W a = purge o θ W a o W a o = purge ({τ(α, a, o) α W }) W = projection(w ) τ(α, a, o)(s) = R(a,s) θ + γ s α(s )P (o s, a)p (s s, a) purge(.) returns the maximizing set of vectors in the belief space ( S dimensional space) Automated optimal decision making. Fall 2002 p.13/21

14 Example ,* ,1 0,1 0,0 0, ,* , ,0 12 1,* 5.5 1, ,1 11 0,* 0,* Automated optimal decision making. Fall 2002 p.14/21

15 Approximations MDP based approximation Grid based approximation Automated optimal decision making. Fall 2002 p.15/21

16 MDP based approximation Is a crude approximation assuming complete observability Solve the underlying MDP and computer value function V MDP Value of a belief state is computed as V (b) = s b(s)v MDP (s) We can also redefine our update rule as V i+1 (b) = s b(s )max a A [R(s, a) + γ s P (s s, a)vi MDP (s)] Note V i will always be described by a single vector The MDP based update above provides an upper bound to the general update using observations. MDP-V(s2) MDP-V(s1) V(b) 0 1 Automated optimal decision making. Fall 2002 p.16/21

17 Upper bound property Let H be our regular POMDP update function involving observations. H MDP is the MDP based update function. We have to show that HV i H MDP V i HV i (b) = max a A s S R(s, a)b(s) + γ o θ s S s S P (s, o s, a)b(s)α MDP i (s ) = max a A s S b(s)[r(s, a) + γ s S P (s s, a)v MDP i (s )] s S b(s)max a A [R(s, a) + γ s S P (s s, a)v MDP i (s )] = H MDP V i (b) Automated optimal decision making. Fall 2002 p.17/21

18 Grid based approximation [Lovejoy 91] General value iteration in all exact algorithms compute the value function over the complete belief space. Compute the value function at finite number of points in the belief space. Compute the value of other belief states with values of our chosen set of belief states using interpolation. Automated optimal decision making. Fall 2002 p.18/21

19 Grid based value function updates Let G be the set of grid points V (b) = E(b, V G ) = G j=1 λ jv G (b j ) 0 λ j 1 G j=1 λ j = 1 V i+1 (b G j ) = max a AR(b G j, a) + γ o θ P (o τ(bg j, a), a)v i(τ(b G j, a, o)) Let us call this update function based on grid points as H G Automated optimal decision making. Fall 2002 p.19/21

20 Lower bounds The grid based update function H G provides a lower bound to the regular update function H. H G V i HV i Proof: Let V i be a set of vectors describing the value function H G will always use only a partial set of vectors from V i for updating the value of the grid points. Since H G only considers maximal vectors at grid points it ignores the maximal vectors at other belief points. In the best case scenario H G V i = HV i Automated optimal decision making. Fall 2002 p.20/21

21 Upper bounds The value function based on the grid points and point interpolation together provides us with an upper bound on the value function. HV (b) = HV G λ i b G i i=1 G i=1 [HV (b G i )] = H G V (b) Automated optimal decision making. Fall 2002 p.21/21

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy. Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Planning with Predictive State Representations

Planning with Predictive State Representations Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Planning and Acting in Partially Observable Stochastic Domains

Planning and Acting in Partially Observable Stochastic Domains Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs Lorenzo Nardi and Cyrill Stachniss Navigation under uncertainty C B C B A A 2 `B` is the most likely position C B C B A A 3

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making

More information

Decision Procedures An Algorithmic Point of View

Decision Procedures An Algorithmic Point of View An Algorithmic Point of View ILP References: Integer Programming / Laurence Wolsey Deciding ILPs with Branch & Bound Intro. To mathematical programming / Hillier, Lieberman Daniel Kroening and Ofer Strichman

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Section Handout 5 Solutions

Section Handout 5 Solutions CS 188 Spring 2019 Section Handout 5 Solutions Approximate Q-Learning With feature vectors, we can treat values of states and q-states as linear value functions: V (s) = w 1 f 1 (s) + w 2 f 2 (s) +...

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve

More information

Solving Risk-Sensitive POMDPs with and without Cost Observations

Solving Risk-Sensitive POMDPs with and without Cost Observations Solving Risk-Sensitive POMDPs with and without Cost Observations Ping Hou Department of Computer Science New Mexico State University Las Cruces, NM 88003, USA phou@cs.nmsu.edu William Yeoh Department of

More information

Heuristic Search Value Iteration for POMDPs

Heuristic Search Value Iteration for POMDPs 520 SMITH & SIMMONS UAI 2004 Heuristic Search Value Iteration for POMDPs Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract We present a novel POMDP

More information

POMDP solution methods

POMDP solution methods POMDP solution methods Darius Braziunas Department of Computer Science University of Toronto 2003 Abstract This is an overview of partially observable Markov decision processes (POMDPs). We describe POMDP

More information

GPU Accelerated Markov Decision Processes in Crowd Simulation

GPU Accelerated Markov Decision Processes in Crowd Simulation GPU Accelerated Markov Decision Processes in Crowd Simulation Sergio Ruiz Computer Science Department Tecnológico de Monterrey, CCM Mexico City, México sergio.ruiz.loza@itesm.mx Benjamín Hernández National

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Exploiting Symmetries for Single and Multi-Agent Partially Observable Stochastic Domains

Exploiting Symmetries for Single and Multi-Agent Partially Observable Stochastic Domains Exploiting Symmetries for Single and Multi-Agent Partially Observable Stochastic Domains Byung Kon Kang, Kee-Eung Kim KAIST, 373-1 Guseong-dong Yuseong-gu Daejeon 305-701, Korea Abstract While Partially

More information

Accelerated Vector Pruning for Optimal POMDP Solvers

Accelerated Vector Pruning for Optimal POMDP Solvers Accelerated Vector Pruning for Optimal POMDP Solvers Erwin Walraven and Matthijs T. J. Spaan Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands Abstract Partially Observable Markov

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Linear programming: algebra

Linear programming: algebra : algebra CE 377K March 26, 2015 ANNOUNCEMENTS Groups and project topics due soon Announcements Groups and project topics due soon Did everyone get my test email? Announcements REVIEW geometry Review geometry

More information

Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department

More information

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Liam MacDermed College of Computing Georgia Institute of Technology Atlanta, GA 30332 liam@cc.gatech.edu Charles L. Isbell College

More information

State Space Compression with Predictive Representations

State Space Compression with Predictive Representations State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University

More information

Bayes-Adaptive POMDPs 1

Bayes-Adaptive POMDPs 1 Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Partially Observable Markov Decision Processes: A Geometric Technique and Analysis

Partially Observable Markov Decision Processes: A Geometric Technique and Analysis OPERATIONS RESEARCH Vol. 58, No., January February 200, pp. 24 228 issn 0030-364X eissn 526-5463 0 580 024 informs doi 0.287/opre.090.0697 200 INFORMS Partially Observable Markov Decision Processes: A

More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Diederik M. Roijers Vrije Universiteit Brussel & Vrije Universiteit Amsterdam Erwin Walraven Delft University of Technology

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations

Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations Computing Optimal Policies for Partially Observable Decision Processes using Compact epresentations Craig Boutilier and David Poole Department of Computer Science niversity of British Columbia Vancouver,

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

Notes taken by Graham Taylor. January 22, 2005

Notes taken by Graham Taylor. January 22, 2005 CSC4 - Linear Programming and Combinatorial Optimization Lecture : Different forms of LP. The algebraic objects behind LP. Basic Feasible Solutions Notes taken by Graham Taylor January, 5 Summary: We first

More information

Course on Automated Planning: Transformations

Course on Automated Planning: Transformations Course on Automated Planning: Transformations Hector Geffner ICREA & Universitat Pompeu Fabra Barcelona, Spain H. Geffner, Course on Automated Planning, Rome, 7/2010 1 AI Planning: Status The good news:

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Decision-theoretic approaches to planning, coordination and communication in multiagent systems

Decision-theoretic approaches to planning, coordination and communication in multiagent systems Decision-theoretic approaches to planning, coordination and communication in multiagent systems Matthijs Spaan Frans Oliehoek 2 Stefan Witwicki 3 Delft University of Technology 2 U. of Liverpool & U. of

More information

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Jiaying Shen Department of Computer Science University of Massachusetts Amherst, MA 0003-460, USA jyshen@cs.umass.edu

More information

A Method for Speeding Up Value Iteration in Partially Observable Markov Decision Processes

A Method for Speeding Up Value Iteration in Partially Observable Markov Decision Processes 696 A Method for Speeding Up Value Iteration in Partially Observable Markov Decision Processes Nevin L. Zhang, Stephen S. Lee, and Weihong Zhang Department of Computer Science, Hong Kong University of

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 stephane.ross@mail.mcgill.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Federal Aviation Administration Optimal Aircraft Rerouting During Commercial Space Launches

Federal Aviation Administration Optimal Aircraft Rerouting During Commercial Space Launches Federal Aviation Administration Optimal Aircraft Rerouting During Commercial Space Launches Rachael Tompa Mykel Kochenderfer Stanford University Oct 28, 2015 1 Motivation! Problem: Launch vehicle anomaly

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24 and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24

More information

Gestion de la production. Book: Linear Programming, Vasek Chvatal, McGill University, W.H. Freeman and Company, New York, USA

Gestion de la production. Book: Linear Programming, Vasek Chvatal, McGill University, W.H. Freeman and Company, New York, USA Gestion de la production Book: Linear Programming, Vasek Chvatal, McGill University, W.H. Freeman and Company, New York, USA 1 Contents 1 Integer Linear Programming 3 1.1 Definitions and notations......................................

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information

Decomposition Methods for Large Scale LP Decoding

Decomposition Methods for Large Scale LP Decoding Decomposition Methods for Large Scale LP Decoding Siddharth Barman Joint work with Xishuo Liu, Stark Draper, and Ben Recht Outline Background and Problem Setup LP Decoding Formulation Optimization Framework

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs Roy Fox Computer Science Department, Technion IIT, Israel Moshe Tennenholtz Faculty of Industrial

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Decayed Markov Chain Monte Carlo for Interactive POMDPs

Decayed Markov Chain Monte Carlo for Interactive POMDPs Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {yhan37,piotr}@uic.edu Abstract

More information

Dynamic Power Management under Uncertain Information. University of Southern California Los Angeles CA

Dynamic Power Management under Uncertain Information. University of Southern California Los Angeles CA Dynamic Power Management under Uncertain Information Hwisung Jung and Massoud Pedram University of Southern California Los Angeles CA Agenda Introduction Background Stochastic Decision-Making Framework

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Relational Partially Observable MDPs

Relational Partially Observable MDPs Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) elational Partially Observable MDPs Chenggang Wang and oni Khardon Department of Computer Science Tufts University

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Localization of Radioactive Sources Zhifei Zhang

Localization of Radioactive Sources Zhifei Zhang Localization of Radioactive Sources Zhifei Zhang 4/13/2016 1 Outline Background and motivation Our goal and scenario Preliminary knowledge Related work Our approach and results 4/13/2016 2 Background and

More information

The simplex algorithm

The simplex algorithm The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case. It does yield insight into linear programs, however,

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations

Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations Computing Optimal Policies for Partially Observable Decision Processes using Compact epresentations Craig Boutilier and David Poole Department of Computer Science niversity of British Columbia Vancouver,

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes by Pascal Poupart A thesis submitted in conformity with the requirements for the degree of Doctor of

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

arxiv: v2 [cs.gt] 4 Aug 2016

arxiv: v2 [cs.gt] 4 Aug 2016 Dynamic Programming for One-Sided Partially Observable Pursuit-Evasion Games Karel Horák, Branislav Bošanský {horak,bosansky}@agents.fel.cvut.cz arxiv:1606.06271v2 [cs.gt] 4 Aug 2016 Department of Computer

More information

Information Gathering and Reward Exploitation of Subgoals for P

Information Gathering and Reward Exploitation of Subgoals for P Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015 http://www.cs.washington.edu/ai/mobile_robotics/mcl/animations/global-floor.gif

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information