Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Similar documents
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Bayesian reinforcement learning

Artificial Intelligence

Lecture 3: Markov Decision Processes

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

RL 14: Simplifications of POMDPs

Markov Decision Processes

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Linear Bayesian Reinforcement Learning

Reinforcement learning an introduction

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Approximate Universal Artificial Intelligence

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Partially Observable Markov Decision Processes (POMDPs)

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Planning by Probabilistic Inference

Grundlagen der Künstlichen Intelligenz

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Reinforcement Learning and Deep Reinforcement Learning

Online regret in reinforcement learning

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Bayesian Mixture Modeling and Inference based Thompson Sampling in Monte-Carlo Tree Search

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search

Autonomous Helicopter Flight via Reinforcement Learning

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Reinforcement Learning as Variational Inference: Two Recent Approaches

(More) Efficient Reinforcement Learning via Posterior Sampling

Active Learning of MDP models

Machine Learning I Continuous Reinforcement Learning

Exercises, II part Exercises, II part

Reinforcement learning

Markov decision processes

Markov Decision Processes (and a small amount of reinforcement learning)

Artificial Intelligence

Probabilistic inference for computing optimal policies in MDPs

Introduction to Artificial Intelligence (AI)

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Final Exam December 12, 2017

Internet Monetization

RL 14: POMDPs continued

Reinforcement Learning

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Markov Chain Monte Carlo Methods for Stochastic

Final Exam December 12, 2017

Reinforcement Learning II

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Bandit models: a tutorial

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Grundlagen der Künstlichen Intelligenz

Optimistic Planning for Belief-Augmented Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Lecture 8: Policy Gradient

Reinforcement Learning: An Introduction

Basics of reinforcement learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Exploration. 2015/10/12 John Schulman

Bayes-Adaptive POMDPs 1

VIME: Variational Information Maximizing Exploration

CS599 Lecture 1 Introduction To RL

Reinforcement Learning. George Konidaris

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

University of Alberta

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Logic, Knowledge Representation and Bayesian Decision Theory

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Control Theory : Course Summary

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

CS 4100 // artificial intelligence. Recap/midterm review!

Lecture 6: Graphical Models: Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Q-Learning in Continuous State Action Spaces

Multi-armed bandit models: a tutorial

Markov Chain Monte Carlo Methods for Stochastic Optimization

Machine Learning Summer School

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Distributed Optimization. Song Chong EE, KAIST

Transcription:

and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24

1 Introduction 2 Bayesian reinforcement learning The expected utility Stochastic branch and bound 3 Partially observable Markov decision processes Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 2 / 24

Introduction Summary of previous developments Probability and utility. Making decisions under uncertainty. Updating probabilies Optimal experiment design Markov decision processes Stochastic algorithms for Markov decision processes. MDP Approximations. Bayesian reinforcement learning Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 3 / 24

Introduction Summary of previous developments Probability and utility. Making decisions under uncertainty. Updating probabilies Optimal experiment design Markov decision processes Stochastic algorithms for Markov decision processes. MDP Approximations. Bayesian reinforcement learning Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 3 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. µ s t s t+1 a t r t+1 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. The optimal policy for a given w µ s t s t+1 a t r t+1 Find policy π : S A maximising the utility U = t rt in expectation. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. The optimal policy for a given w µ s t s t+1 a t r t+1 Find policy π : S A maximising the utility U = t rt in expectation. When w is known, use standard algorithms, such as value or policy iteration. However this is Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. The optimal policy for a given w µ ξ s t s t+1 a t r t+1 Find policy π : S A maximising the utility U = t rt in expectation. When w is known, use standard algorithms, such as value or policy iteration. However this is contrary to the problem definition! Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. Bayesian RL: Use a subjective belief ξ(µ) µ ξ s t s t+1 a t r t+1 E(U π,ξ) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. Bayesian RL: Use a subjective belief ξ(µ) µ ξ s t s t+1 a t r t+1 E(U π,ξ) = E(U π,µ)ξ(µ) µ Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

The reinforcement learning problem Learning to act in an unknown environment, by interaction and reinforcement. Markov decision processes (MDP) We are in some environment µ, where at each time step t: Observe state s t S. Take action a t A. Receive reward r t R. µ ξ s t s t+1 a t r t+1 Bayesian RL: Use a subjective belief ξ(µ) Not actually easy as π must now map from complete histories to actions. Uξ = max E(U π,ξ) = max E(U π,µ)ξ(µ) π π Planning must take into account future learning µ Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 4 / 24

Updating the belief Example When the number of MDPs is finite Exercise Another practical scenario is when we have an independent belief over the transition probabilities of each state-action pair. Consider the case where we have n states and k actions. Similar to the product-prior in the bandit exercise of exercise set 4, we assign a probability (density) ξ s,a to the probability vector θ (s,a) S n. We can then define our joint belief on the (nk) n matrix Θ to be ξ(θ) = ξ s,a(θ (s,a) ). s S,a A Derive the updates for a product-dirichlet prior on transitions and a product-normal-gamma prior on rewards. What is the meaning of using a Normal-Wishart prior on rewards? Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 5 / 24

The expected MDP heuristic 1 For a given belief ξ, calculate the expected MDP: µ ξ E ξ µ. 2 Calculate the optimal memoryless policy for µ ξ : π ( µ ξ ) argmax π Π 1 V π µ ξ, where Π 1 = { π Π P π(a t s t,a t 1 ) = P π(a t s t) }. 3 Execute π ( µ ξ ). Problem Unfortunately, this approach may be far from the optimal policy in Π 1. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 6 / 24

Counterexample 1 0 0 0 ǫ 0 0 ǫ 0 0 ǫ 0 0 1 1 1 (a) µ 1 (b) µ 2 (c) µ ξ Figure: M = {µ 1,µ 2 }, ξ(µ 1 ) = θ, ξ(µ 2 ) = 1 θ, deterministic transitions. For T, the µ ξ -optimal policy is not optimal in Π 1 if: ǫ < γθ(1 θ) ( ) 1 1 γ 1 γθ + 1 1 γ(1 θ) In this example, µ ξ / M. For smooth beliefs, µ ξ is close to ˆµ ξ. 1 Based on one by Remi Munos Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 7 / 24

Counterexample for ˆµ ξ argmax µξ(µ) MDP set M = {µ i i = 1,...,n} with A = {0,...,n}. In all MDPs, a 0 gives you a reward of ǫ and the MDP terminates. In the i-th MDP, all other actions give you a reward of 0 apart from the i-th action which gives you a reward of 1. a = 0 r = ǫ a = 1i r = 01 a = n r = 0 Figure: The MDP µ i. The ξ-optimal policy takes action i iff ξ(µ i ) ǫ, otherwise takes action 0. The ˆµ ξ-optimal policy takes a = argmax i ξ(µ i ). Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 8 / 24

The expected utility Policy evaluation Expected utility of a policy π for a belief ξ Vξ π E(U ξ,π) (2.1) = E(U µ,π)dξ(µ) (2.2) M = Vµ π dξ(µ) (2.3) M Bayesian Monte-Carlo policy evaluation input policy π, belief ξ for k = 1,...,K do µ k ξ. v k = V π µ k end for u = 1 K K k=1 v k. return u. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 9 / 24

The expected utility Upper bounds on the utility for a belief ξ Vξ supe(u ξ,π) = sup E(U µ,π)dξ(µ) (2.4) π π M supvµ π dξ(µ) = Vµ dξ(µ) V + ξ (2.5) π M M Bayesian Monte-Carlo upper bound input policy π, belief ξ for k = 1,...,K do µ k ξ. v k = V µ k end for u = 1 K K k=1 v k. return u. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 10 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 V µ 2 V ξ ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 E(V µ ξ) V µ 2 V ξ ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 E(V µ ξ) V µ 2 V ξ π 1 ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 E(V µ ξ) π 2 V µ 2 V ξ π 1 ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 E(V µ ξ) π 2 V µ 2 V ξ π (ξ 1) π 1 ξ 1 ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Bounds on V ξ max πe(u π,ξ) V V µ 1 E(V µ ξ) π 2 i w iv ξ i V µ 2 V ξ π (ξ 1) π 1 ξ 1 ξ Figure: A geometric view of the bounds Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 11 / 24

The expected utility Better lower bounds [? ] expected utility over all states 80 70 60 50 40 30 20 new bound naive bound upper bound 10 0 20 40 60 80 100 Uncertain ξ Certain Main idea: maximisation in memoryless policies Then we can assume a fixed belief. Backwards induction on n MDPs This improves the naive lower bound. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 12 / 24

The expected utility Qξ,t(s,a) π M { } R µ(s,a)+γ Vµ,t+1(s π )dtµ s,a (s ) dξ(µ) (2.6) S Multi-MDP Backwards Induction 1: MMBIM,ξ,γ,T 2: Set V µ,t+1 (s) = 0 for all s S. 3: for t = T,T 1,...,0 do 4: for s S,a A do 5: Calculate Q ξ,t (s,a) from (2.6) using {V µ,t+1}. 6: end for 7: for s S do 8: aξ,t(s) argmax a A Q ξ,t (s,a). 9: for µ M do 10: V µ,t(s) = Q µ,t(s,aξ,t(s)). 11: end for 12: end for 13: end for Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 13 / 24

The expected utility regret 550 500 450 400 350 300 250 MCBRL Exploit 2 4 6 8 10 12 14 16 n MCBRL: Application to Bayesian RL 1 For i = 1,... 2 At time t i, sample n MDPs from ξ ti. 3 Calculate best memoryless policy π i wrt the sample. 4 Execute π i until t = t i+1. Relation to other work For n = 1, this is equivalent to the Thompson sampling used by Strens [? ]. Unlike BOSS [? ] it does not become more optimistic as n increases. BEETLE[?? ] is a belief-sampling approach. Furmston and Barber [? ] use approximate inference to estimate policies. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 14 / 24

The expected utility Generalisations Policy search for improving lower bounds. Search enlarged class of policies Examine all history-based policies. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 15 / 24

The expected utility ω t ω t+1 s t s t+1 ξ t ξ t+1 ω t ω t+1 a t a t (a) The complete MDP model (b) Compact form of the model Figure: Belief-augmented MDP The augmented MDP The optimal policy for the augmented MDP is the ξ-optimal for the original problem. P(s t+1 S ξ t,s t,a t) P µ(s t+1 S s t,a t)dξ t(µ) (2.7) S ξ t+1( ) = ξ t( s t+1,s t,a t) (2.8) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 16 / 24

The expected utility Belief-augmented MDP tree structure Consider an MDP family M with A = { a 1,a 2}, S = { s 1,s 2}. ω t ω t = (s t,ξ t) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 17 / 24

The expected utility Belief-augmented MDP tree structure Consider an MDP family M with A = { a 1,a 2}, S = { s 1,s 2}. a 1,s 1 ω 1 t+1 ω t ω t = (s t,ξ t) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 17 / 24

The expected utility Belief-augmented MDP tree structure Consider an MDP family M with A = { a 1,a 2}, S = { s 1,s 2}. ω 1 t+1 ω t a 1,s 1 ω 2 a 1,s 2 a 2,s 1 t+1 a 2,s 2 ω t = (s t,ξ t) ω 3 t+1 ω 4 t+1 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 17 / 24

Stochastic branch and bound Branch and bound Value bounds Let upper and lower bounds q + and q such that: q + (ω,a) Q (ω,a) q (ω,a) (2.9) v + (ω) = max a A Q+ (ω,a), v (ω) = max a A Q (ω,a). (2.10) q + (ω,a) = ω p(ω ω,a) [ r(ω,a,ω )+V + (ω ) ] (2.11) q (ω,a) = ω p(ω ω,a) [ r(ω,a,ω )+V (ω ) ] (2.12) Remark If q (ω,a) q + (ω,b) then b is sub-optimal at ω. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 18 / 24

Stochastic branch and bound Stochastic branch and bound for belief tree search [?? ] (Stochastic) Upper and lower bounds on the values of nodes (via Monte-Carlo sampling) Use upper bounds to expand tree, lower bounds to select final policy. Sub-optimal branches are quickly discarded. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 19 / 24

Partially observable Markov decision processes Partially observable Markov decision processes (POMDP) When acting in µ, each time step t: The system state s t S is not observed. We receive an observation x t X and a reward r t R. We take action a t A. The system transits to state s t+1. µ a t s t s t+1 x t x t+1 r t r t+1 Definition Partially observable Markov decision process (POMDP) A POMDP µ M P is a tuple (X,S,A,P) where X is an observation space, S is a state space, A is an action space, and P is a conditional distribution on observations, states and rewards. The following Markov property holds: P µ(s t+1,r t,x t s t,a t,...) = P(s t+1 s t,a t)p(x t s t)p(r t s t) (3.1) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 20 / 24

Partially observable Markov decision processes Partially observable Markov decision processes (POMDP) When acting in µ, each time step t: The system state s t S is not observed. We receive an observation x t X and a reward r t R. We take action a t A. The system transits to state s t+1. ξ µ a t s t s t+1 x t x t+1 r t r t+1 Definition Partially observable Markov decision process (POMDP) A POMDP µ M P is a tuple (X,S,A,P) where X is an observation space, S is a state space, A is an action space, and P is a conditional distribution on observations, states and rewards. The following Markov property holds: P µ(s t+1,r t,x t s t,a t,...) = P(s t+1 s t,a t)p(x t s t)p(r t s t) (3.1) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 20 / 24

Partially observable Markov decision processes Belief state in POMDPs when µ is known If µ defines starting state probabilities, then the belief is not subjective Belief ξ For any distribution ξ on S, we define: ξ(s t+1 a t,µ) P µ(s t+1 s ta t)dξ(s t) (3.2) S Belief update Pµ(xt+1,rt+1 st+1)ξt(st+1 at,µ) ξ t(s t+1 x t+1,r t+1,a t,µ) = (3.3) ξ t(x t+1 a t,µ) ξ t(s t+1 a t,µ) = P µ(s t+1 s t,a t,µ)dξ t(s t) (3.4) S ξ t(x t+1 a t,µ) = P µ(x t+1 s t+1)dξ t(s t+1 a t,µ) (3.5) S Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 21 / 24

Partially observable Markov decision processes Example If S,A,X are finite, and then we can define t(j) = P(x t s t = j) A t(i,j) = P(s t+1 = j s t = i,a t). b t(i) = ξ t(s t = i) We can then use Bayes theorem: b t+1 = diag(pt+1)atbt p t+1 Atbt, (3.6) Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 22 / 24

Partially observable Markov decision processes When the POMDP µ is unknown ξ(µ,s t x t,a t ) P µ(x t s t,a t )P µ(s t a t )ξ(µ) (3.7) Cases Finite M. Finite S General case Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 23 / 24

Partially observable Markov decision processes Strategies for POMDPs Bayesian RL on POMDPs? EXP inference and planning Approximations and stochastic methods. Policy search methods. Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 24 / 24