RL 14: POMDPs continued

Similar documents
RL 14: Simplifications of POMDPs

Partially Observable Markov Decision Processes (POMDPs)

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Artificial Intelligence

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Planning and Acting in Partially Observable Stochastic Domains

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

RL 3: Reinforcement Learning

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Factored State Spaces 3/2/178

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Point-Based Value Iteration for Constrained POMDPs

Markov decision processes

Final Exam December 12, 2017

Heuristic Search Value Iteration for POMDPs

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Final Exam December 12, 2017

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Efficient Maximization in Solving POMDPs

Markov Decision Processes Chapter 17. Mausam

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

POMDPs and Policy Gradients

Planning Under Uncertainty II

Chapter 3: The Reinforcement Learning Problem

Decision-theoretic approaches to planning, coordination and communication in multiagent systems

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Artificial Intelligence & Sequential Decision Problems

Chapter 3: The Reinforcement Learning Problem

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Information Gathering and Reward Exploitation of Subgoals for P

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning I Reinforcement Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Bayes-Adaptive POMDPs 1

Food delivered. Food obtained S 3

Dialogue as a Decision Making Process

Reinforcement learning

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes

Autonomous Mobile Robot Design

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

CS 7180: Behavioral Modeling and Decisionmaking

Prediction-Constrained POMDPs

MDP Preliminaries. Nan Jiang. February 10, 2019

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Lecture 3: The Reinforcement Learning Problem

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning: An Introduction

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Grundlagen der Künstlichen Intelligenz

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Lecture 3: Markov Decision Processes

Control Theory : Course Summary

An Adaptive Clustering Method for Model-free Reinforcement Learning

Partially Observable Markov Decision Processes (POMDPs)

Learning in Zero-Sum Team Markov Games using Factored Value Functions

10 Robotic Exploration and Information Gathering

Introduction to Reinforcement Learning

Learning Tetris. 1 Tetris. February 3, 2009

CS599 Lecture 1 Introduction To RL

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Chapter 16 Planning Based on Markov Decision Processes

6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Tutorial on Optimization for PartiallyObserved Markov Decision. Processes

Probabilistic Planning. George Konidaris

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

Accelerated Vector Pruning for Optimal POMDP Solvers

Reinforcement Learning and Control

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Probabilistic robot planning under model uncertainty: an active learning approach

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Internet Monetization

Basics of reinforcement learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Model-Based Reinforcement Learning with Continuous States and Actions

Decision Theory: Q-Learning

Lecture 1: March 7, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement learning

Transcription:

RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015

POMDPs: Points to remember Belief states are probability distributions over states Even if computationally complex, POMDPs can be useful as a modelling approach (consider simplification of the implementation in a second stage) POMDPs enable agents to deal with uncertainty efficiently POMDPs are Markov w.r.t. belief states Beliefs tend to blur as consequence of the state dynamics, but can refocus by incorporating observations via Bayes rule. Policy trees take all possible realisations of the sequence of future observations into account, i.e. the choice of the current action depends on the average over many futures. This causes exponential complexity unless the time horizon is truncated (standard) or approximations are used (e.g. QMDP, AMPD, and sample-based methods). Often some states are fully observable and these may be the states where decisions are critical (e.g. a robot turning when observing a doorway)

Markov models A. Cassandra: http://www.pomdp.org/faq.shtml

Solving small information-gathering problems

POMDP: Applications Belief-MDP (Åström, 1965) Autonomous robot localisation and navigation Classical RL applications (test problems): elevator control, machine maintenance, structural inspection, Business (IT): marketing, troubleshooting, dialog systems, distributed data base queries, routing Operations research: Medicine, finance, fishery industry, conservation, education Anthony Cassandra. A Survey of POMDP Applications. Presented at the AAAI Fall Symposium, 1998. Young, S., Gasic, M., Thomson, B., & Williams, J. D. (2013). POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5), 1160-1179.

Belief spaces In terms of belief states, POMDPs are MDPs: previous methods are applicable. Belief state: a line for 2 states,..., a simplex for N states Value function over belief state is piecewise linear and convex (Sondik, 1978) Represent by points point-based algorithms Represent by vectors α-vector based

Value iteration V t (b) = max a A = max a A = max a A ( b (s) s S s S ( T ( s s, a ) Ω ( o s, a ) ( R a ( ss o + V t 1 b o ( a s )))) o Ω (s) R (s, s Sb a)+ (s) s Sb s S ( (s) R (s, s Sb a)+ max (s) k o Ω s Sb s S T ( s s, a ) Ω ( o s, a ) ( V t 1 b o ( a s ))) o Ω T ( s s, a ) Ω ( o s, a ) αt 1 k ( s )) = max a A b (s) R (s, a) + T ( s s, a ) Ω ( o s, a ) α l(b,a,o) ( t 1 s ) s S o Ω s S }{{} α k t (s)

Algorithm POMDP(T ) (based on a set of points x i ) Υ = {(0,..., 0)}, U = for τ = 1 to T do Υ = for all ( U; v1 k,..., vn) k in Υ do for all control actions u do for all measurements z do for j = 1 to N do vj,u,z k = N i=1 v i k p (z x i ) p (x i u, x j ) endfor endfor endfor endfor for all control actions u do for all k = 1 to Υ do for i = 1 to N do v i = r (x i, u) + γ z v k i,u,z endfor add u to U and (U; v 1,..., v N) to Υ endfor endfor optional: prune Υ Υ=Υ endfor return Υ

Remarks on the algorithm Without pruning Υ increases exponentially with T The algorithm describes the determination of the value function. Value iteration, actual observations and actions are not entering. Further steps in algorithm Find value function on policy trees up to a given T Determine maximum over branches and perform first action Recalculate policy taking into account observations and rewards Update observation model, transition model and reward model Many variants exist.

A systematic approach to POMDPs POMDPs have no information about states but about observation outcomes and preformed actions. In the following we start with a trivial expression of this fact and use this to motivate belief states.

Finite Horizon Problems COMDP with horizon T : Optimal policy is a sequence of mappings π t : S A, t = 0,..., T 1 π =(πt ) t=0,...,t 1 =arg max E r (s T ) + π 0,...,π T 1 [ T 1 t=0 r (s t, π t (s t )) Previously, we assumed that all t share the same mapping. POMDP: Optimal policy π depends on past actions and observations. Policies are trees π : (A O) A Optimal policy trees can be calculated by value iteration on branches of policy trees. Branches correspond to sequences of actions and observations ]

Information states In principle all previous information is needed to decide about values and actions. This is realised by information states I (A O), e.g. I t = (a 0, o 0,..., a t, o t ). Then it holds (trivially) that P (I t+1 = (a 0, o 1,..., a t+1, o t+1 ) a t+1, I t ) = P(o t+1 a t+1, I t ) A POMDP is an information state MDP (with a huge state space) Rewards r I (I t, a t ) = s S P (s t = s I t ) r (s, a)

Value iteration on information states Initialisation V T (I T ) = s S P (s T = s I T ) r (s) Bellman optimality equation ( V (I t ) =max P (s t = s I t ) r (s, a) a A s S + γ o O P (o I t, a) V (I t+1 = (a 0,..., a = a t, o = o t+1 )) ) State space grows exponentially with T (episode length) Belief states summarise information states by distributions of states.

Equivalence of Information States and Belief States For an underlying Markov process, information states and belief states are equivalent in the sense that the lead to the same optimal value functions. Belief states summarise information states by distributions of states Optimal value functions are defined by the Bellman optimality equation. Equivalency in terms of the value functions follows from the compatibility of the belief update with the Bellman equation

Belief update b ( s ) = P ( s z, a, b ) = P (z s, a, b) P (s a, b) P (z a, b) = P (z s, a) s S P (s a, b, s) P (s a, b) P (z a, b) = O (z, s, a) s S T (s, a, s) b (s) P (z a, b) z observation, u action, s state, b belief (distribution of states) O observation model, T state transition probability Rewards on belief states: ρ (b, u) = s S b (s) R (s, a)

Value iteration on belief states Initialisation V T (b) = s S b (s) r (s) Bellman equation ( V (b) =max b (s) r (s, a) a A s S + γ P ( o s, a ) P ( s s, a ) b ( s ) V ( b (o, a) ) o O s,s S

Policy iteration Backup in belief space: ( V (b) = max r (b, u) + γ V ( b ) p ( b u, b ) ) db u p ( b u, b ) = p ( b u, b, z ) p (z u, b) dz ( [ V (b) = max r(b,u)+γ V ( b ) p ( b u, b, z ) ] ) db p(z u, b)dz u

Recent and current research Solution of Gridworld POMDPs (M. Hausknecht, 2000) Point-based value iteration (J. Pineau, 2003) Large problems: Heuristic Search Value Iteration (T. Smith & R. Simmons, 2004): 12545 states, considering bounds for the value function over belief states Learning POMDPs from data (Learning a model of the dynamics) compressed predictive state representation Bayes-adaptive POMDPs (tracking the dynamics of belief states) Policy search, hierarchical POMDPs, decentralised POMDPs,... Joelle Pineau (2013) A POMDP Tutorial. European Workshop on Reinforcement Learning.

Summary on POMDPs POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piece-wise linear and convex, but very complicated A number of heuristic and stochastic approaches are available to reduce the complexity. Combinations with other RL approaches possible POMDPs have been applied successfully to realistic problem is robotics

References Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic robotics. MIT press. Chapters 15 and 16. (text book) Milos Hausknecht (2000) Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research 13, 33-94. (detailled paper) Joelle Pineau (2013) A POMDP Tutorial. European Workshop on Reinforcement Learning. (review on recent research)