Markov Decision Processes

Similar documents
Reinforcement Learning. Introduction

Decision Theory: Markov Decision Processes

Reinforcement Learning

Decision Theory: Q-Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning and Control

CSC321 Lecture 22: Q-Learning

Lecture 3: Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning and Deep Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

The Reinforcement Learning Problem

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Artificial Intelligence

Markov decision processes

Multiagent Value Iteration in Markov Games

Some AI Planning Problems

Markov Decision Processes and Solving Finite Problems. February 8, 2017

6 Reinforcement Learning

Chapter 16 Planning Based on Markov Decision Processes

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

, and rewards and transition matrices as shown below:

Probabilistic Planning. George Konidaris

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Reinforcement Learning

Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

REINFORCEMENT LEARNING

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Partially Observable Markov Decision Processes (POMDPs)

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning as Classification Leveraging Modern Classifiers

Introduction to Artificial Intelligence (AI)

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Markov Decision Processes Chapter 17. Mausam

Introduction to Artificial Intelligence (AI)

Q-Learning in Continuous State Action Spaces

ARTIFICIAL INTELLIGENCE. Reinforcement learning

1 MDP Value Iteration Algorithm

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Zero-Sum Stochastic Games An algorithmic review

Lecture 3: The Reinforcement Learning Problem

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Internet Monetization

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Grundlagen der Künstlichen Intelligenz

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Discrete planning (an introduction)

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning and NLP

Sequential Decision Problems

Logic, Knowledge Representation and Bayesian Decision Theory

Markov Decision Processes Chapter 17. Mausam

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

State Space Abstractions for Reinforcement Learning

Reinforcement learning an introduction

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Intelligent Systems (AI-2)

Lecture 10 - Planning under Uncertainty (III)

Preference Elicitation for Sequential Decision Problems

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Trust Region Policy Optimization

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Factored State Spaces 3/2/178

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Reinforcement Learning

Reinforcement Learning

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Grundlagen der Künstlichen Intelligenz

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Final Exam December 12, 2017

Motivation for introducing probabilities

Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning: An Introduction

RL 3: Reinforcement Learning

Reinforcement learning

Final Exam December 12, 2017

Chapter 3: The Reinforcement Learning Problem

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Lecture 8: Policy Gradient

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Evaluation of multi armed bandit algorithms and empirical algorithm

Transcription:

Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30

Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour of robotic demostrations. Noel Welsh () Markov Decision Processes 11 November 2010 2 / 30

Recap We have studied the simplest decision making problem, the bandit problem. In the bandit problem is very restrictive. Today we re going to look at partially relaxing the restrictions. Noel Welsh () Markov Decision Processes 11 November 2010 3 / 30

Extending the Bandit Problem Two extensions to the bandit problem: State Observations (in a restricted sense) The model we ll arrive at is called a Markov decision process (MDP). Noel Welsh () Markov Decision Processes 11 November 2010 4 / 30

State What does state mean? Informally, actions have consequences. In (the MDP extension to) the bandit problem, this means rewards for an action are not always given by the same distribution. Rewards are a function of state and action. Noel Welsh () Markov Decision Processes 11 November 2010 5 / 30

Control The process does not transition randomly between states. If it did, this would be (like) the non-stochastic bandit I mentioned briefly last lecture. Rather, the robot s actions influence the choice of next state. So, the robot has some control over its environment. In the general case, actions do not decide the next state, but influence its probability. Noel Welsh () Markov Decision Processes 11 November 2010 6 / 30

The MDP Formalism Two additions (so far) to the bandit model: Rewards a function of action and state Next state a function of action and current state. From this we get the Markov decision process (MDP): S is the set of states that the environment may be in. A is the set of actions that the agent can perform in the environment. T is the transition function, which determines the next state given the current state, and the chosen action. T : S A S [0, 1] defines the probability T (s, a, s ) = P(s a, s) ρ is the reward function that maps state and action pairs to reward: ρ : S A R I is the initial distribution over states I : S [0, 1] defining the probability I(s) = P(s) Noel Welsh () Markov Decision Processes 11 November 2010 7 / 30

The Reward Model In the bandit model there was a distribution over rewards. Here we ve said that rewards are a deterministic function of state and action (ρ : S A R). What s up? Can view this as a simplification. Rewards are a deterministic function of state and action in expectation (or average, or mean). So long as we have a model (we ll talk about this later) this doesn t make any difference. Noel Welsh () Markov Decision Processes 11 November 2010 8 / 30

The Markov Property Distribution over next state determined only by the current state and action. Past irrelevant given the current state. This is the Markov property. The future is independent of the past given the present. Makes computation easier. Noel Welsh () Markov Decision Processes 11 November 2010 9 / 30

What is a State? What exactly is a state? Can reverse Markov property to define it. A state is sufficient information about the world such that the Markov property holds. Hence, a state is a sufficient statistic. It summarises everything we need to know. Applications in inferring models from data. Noel Welsh () Markov Decision Processes 11 November 2010 10 / 30

What is State Take 2 If the set of states S has no structure we call it enumerated. If states are represented as tuples of values it is called a factored representation. If the states are represented by a graph of relations the model is a relational MDP. Basic theory deals with enumerated representations. Specialised algorithms apply to factored and relational representations, but you can still use the fundamental enumerated algorithms with suitable modifications. Noel Welsh () Markov Decision Processes 11 November 2010 11 / 30

Observations In the MDP observations uniquely identify the state. This is a fully observable model. So often just remove observations from the model. Consider states to be observations. If this isn t true, the world is partially observable. The corresponding model is the partially observable Markov decision process (POMDP). Do robots operate in fully or partially observable worlds? Noel Welsh () Markov Decision Processes 11 November 2010 12 / 30

MDPs in Robotics Why use MDPs in robotics. Two good reasons: Engineering trade-off. MDPs are much easier to work with. For modelling purposes a MDP may be sufficient. Sensors (and processing) actually getting very good. E.g. FAB-MAP [Cummins and Newman(2008)], given thousands of images can match that were taken at the same spot with very little error. Maybe an assumption of full observability isn t so bad? Noel Welsh () Markov Decision Processes 11 November 2010 13 / 30

Other Limitations of MDPs We only consider discrete state and action spaces. Continuous spaces more appropriate for low-level control (but you said you don t want to learn about this). Not considering more than one agent. These models known as Markov games. Time is somewhat problematic. Noel Welsh () Markov Decision Processes 11 November 2010 14 / 30

Solving a MDP What do we want to do with a MDP? Find an (near-)optimal way to act. What does that mean? We need a measure of quality that is, an objective function. In the following, we assume we have a model available. Known as planning. If we get time we ll look at what we do when we don t have a model available, which is called reinforcement learning. Noel Welsh () Markov Decision Processes 11 November 2010 15 / 30

Value In the bandit problem we used regret as our objective function. In MDPs the traditional objective function is value. More recent work tends to use regret, but we re kicking it old skool today. Noel Welsh () Markov Decision Processes 11 November 2010 16 / 30

Policies A policy is a way of acting. Always consider the value of some policy. By the Markov property the current state and action completely determine the probability of the next state and reward. Thus the current state encapsulates everything necessary to choose the best action. Thus we can assume a stationary or reactive policy. Formally a stationary policy π is a function from states to actions; π : S A. Noel Welsh () Markov Decision Processes 11 November 2010 17 / 30

Value, Again Consider only a single time step. The environment is in state s The one-step value for a policy π is the reward is receives for taking a single action. We write it as V π 1 (s). What is it? Noel Welsh () Markov Decision Processes 11 November 2010 18 / 30

Optimal One-step Value We can easily work out the best one-step value V 1 (s). What is it? From this we can derive a one-step optimal policy π 1 (s) = argmax ρ(s, a) (1) a A Noel Welsh () Markov Decision Processes 11 November 2010 19 / 30

Exploration and Exploitation One-step value is not the true value. Why not? Noel Welsh () Markov Decision Processes 11 November 2010 20 / 30

The True Value Given a policy π we can write the value of the policy as: ( V π (s) = ρ(s, π(s)) + γ ) T (s, π(s), s )ρ(s, π(s )) s S ( + γ ) 2 T (s, π(s), s )T (s, π(s ), s )ρ(s, π(s )) s S s S +... (2) Confused? Noel Welsh () Markov Decision Processes 11 November 2010 21 / 30

Understanding Value We include the transition function T to weight the future rewards by the probability of achieving them. The discount factor γ serves two purposes: it keeps the sum of rewards bounded, and it determines how we weight future rewards relative to immediate rewards. If the discount factor is zero we completely ignore future rewards. If the discount factor is one, future rewards, no matter how far away they are, are considered equal to immediate rewards. Values between zero and one given proportionally more weight to future rewards. Values greater than one give more weight to rewards the further away they are a nonsensical choice in any realistic setting. Noel Welsh () Markov Decision Processes 11 November 2010 22 / 30

Value, Simplified Note that the definition of value has a recursive structure. This is more apparent in the standard definition of value, which is: V π (s) = ρ(s, π(s)) + γ s S ( T (s, π(s), s )V π (s ) ) (3) Noel Welsh () Markov Decision Processes 11 November 2010 23 / 30

Optimal Value The previous definition gives the value of a given policy. We can instead maximise value, yielding: ( V (s) = max a A ρ(s, a) + γ s S ( T (s, a, s )V (s ) )) (4) Noel Welsh () Markov Decision Processes 11 November 2010 24 / 30

Optimal Policy Given V we can define the optimal policy as that which chooses the action giving the highest value: ( π (s) = argmax a A ρ(s, a) + γ s S ( T (s, a, s )V (s ) )) (5) Noel Welsh () Markov Decision Processes 11 November 2010 25 / 30

Bellman Equations The optimal value (or policy) equation is a known as a Bellman equation. It is a set of simultaneous equations. We solved simultaneous equations in high school, so what s the big deal? The maximisation operator means the equations are not linear so Gaussian elimination cannot be used to solve them. Luckily, there are simple algorithms for solving MDPs. Noel Welsh () Markov Decision Processes 11 November 2010 26 / 30

Value Iteration Overview Value iteration is a classic algorithm for solving MDPs. The inner loop of value iteration takes a t 1 step estimate of the value function and extends it a single time step into the future. The outer loop just repeats the inner loop till the maximum change in value is less than some threshold. Noel Welsh () Markov Decision Processes 11 November 2010 27 / 30

Value Iteration Inner Loop Function onevalueiteration(< S, A, T, ρ, I >, V t 1 ) Inner loop of value iteration for an MDP for s S do for a A do Q t (s, a) ρ(s, a) + γ s S T (s, a, s )V t 1 (s ); end π t (s) argmax a Q t (s, a); V t (s) Q t (s, π t (s)); end return (V t, π t ); Noel Welsh () Markov Decision Processes 11 November 2010 28 / 30

Value Iteration Outer Loop Function valueiteration(< S, A, T, ρ, I >, ɛ) Value iteration for an MDP t 0 ; V 0 0 ; repeat t t + 1 ; (V t, π t ) onevalueiteration(v t 1 ) ; until max s V t (s) V t 1 (s) < ɛ; return (V t, π t ) Noel Welsh () Markov Decision Processes 11 November 2010 29 / 30

Bibliography Mark Cummins and Paul Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. The International Journal of Robotics Research, 27(6):647 665, 2008. Noel Welsh () Markov Decision Processes 11 November 2010 30 / 30