Markov decision processes

Similar documents
Reinforcement Learning. Introduction

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Decision Theory: Markov Decision Processes

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Decision Theory: Q-Learning

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Yishay Mansour Tel-Aviv University

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

, and rewards and transition matrices as shown below:

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 1: March 7, 2018

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Markov Decision Processes Chapter 17. Mausam

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Planning Under Uncertainty II

Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning

Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning and Deep Reinforcement Learning

Lecture 3: Markov Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Reinforcement Learning

Discrete planning (an introduction)

AM 121: Intro to Optimization Models and Methods: Fall 2018

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Some AI Planning Problems

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Control Theory : Course Summary

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Reinforcement Learning and Control

Internet Monetization

Artificial Intelligence & Sequential Decision Problems

Grundlagen der Künstlichen Intelligenz

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

The Reinforcement Learning Problem

CS 4649/7649 Robot Intelligence: Planning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Markov Decision Processes (and a small amount of reinforcement learning)

1 MDP Value Iteration Algorithm

Markov Decision Processes and Dynamic Programming

Preference Elicitation for Sequential Decision Problems

Reinforcement Learning

Planning in Markov Decision Processes

Lecture 3: The Reinforcement Learning Problem

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CS599 Lecture 1 Introduction To RL

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

RL 14: POMDPs continued

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

CS 598 Statistical Reinforcement Learning. Nan Jiang

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Countable and uncountable sets. Matrices.

RL 14: Simplifications of POMDPs

Markov Decision Processes and Dynamic Programming

CSE250A Fall 12: Discussion Week 9

Markov Decision Processes Chapter 17. Mausam

November 28 th, Carlos Guestrin 1. Lower dimensional projections

16.410/413 Principles of Autonomy and Decision Making

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Q-learning. Tambet Matiisen

Basics of reinforcement learning

Autonomous Helicopter Flight via Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Final Exam December 12, 2017

Probabilistic Planning. George Konidaris

CS 234 Midterm - Winter

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Markov Decision Processes Infinite Horizon Problems

RECURSION EQUATION FOR

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement Learning

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CSC321 Lecture 22: Q-Learning

Chapter 16 Planning Based on Markov Decision Processes

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Practicable Robust Markov Decision Processes

Introduction to Reinforcement Learning

CS181 Midterm 2 Practice Solutions

CS 188: Artificial Intelligence Spring Announcements

Final Exam December 12, 2017

A Gentle Introduction to Reinforcement Learning

Laplacian Agent Learning: Representation Policy Iteration

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Finding optimal configurations ( combinatorial optimization)

State Space Reduction for Hierarchical Policy Formation

Machine Learning I Reinforcement Learning

REINFORCEMENT LEARNING

Markov decision processes and interval Markov chains: exploiting the connection

Methods for finding optimal configurations

Transcription:

CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only the material covered after the midterm Term project assignment: Thursday, December 11, 2008 noon

Decision trees Decision tree: A basic approach to represent decision making problems in the presence of uncertainty Limitations: What if there are many decision stages to consider? What if there are many different outcomes? Markov decision process (MDP) A framework for representing complex multi-stage decision problems in the presence of uncertainty More compact representation of the decision problem Efficient solutions Markov decision process Markov decision process (MDP) Models the dynamics of the environment under different actions Frequently used in AI, OR, control theory Markov assumption: next state depends on the previous state and action, and not states (actions) in the past action t-1 state t-1 state t reward t-1

Markov decision process action t-1 state t-1 state t reward t-1 Formal definition: 4-tuple ( S, A, T, R) A set of states S ( X ) locations of a robot A set of actions A move actions Transition model S A S [0,1] where can I get with different moves Reward model S A S R reward/cost for a transition Markov decision problem Example: agent navigation in the Maze 4 moves in compass directions Effects of moves are stochastic we may wind up in other than intended location with non-zero probability Objective: reach the goal state in the shortest expected time moves

Agent navigation example An MDP model: State: S position of an agent Action: A a move Transition model P Reward: R -1 for each move +100 for reaching the goal moves oal: find the policy maximizing future expected rewards Policy A policy: maps states to actions π : S A moves π : Position 1 Position 2 Position 20 right right left

MDP problem We want to find the best policy π : S A Value function ( V ) for a policy, quantifies the goodness of a policy through, e.g. infinite horizon, discounted model It: E ( =0 t γ t rt 1. combines future rewards over a trajectory 2. combines rewards for multiple trajectories (through expectation-based measures) ) Valuation models Objective: Find a policy π : S A That maximizes some combination of future reinforcements (rewards) received over time Valuation models (quantify how good the mapping is): Finite horizon model E ( T r t t = 0 Infinite horizon discounted model = 0 t Average reward T 1 lim E ( rt ) T T ) E ( γ t rt ) Discount factor: 0 < γ < 1 t = 0 Time horizon: T > 0

Value of a policy for MDP Assume a fixed policy π : S A How to compute the value of a policy under infinite horizon discounted model? Fixed point equation: V π ( s ) = R ( s, π ( s )) expected one step reward for the first action + γ For a finite state space we get a set of linear equations s ' S P ( s ' s, π ( s )) V π ( s ' ) expected discounted reward for following the policy for the rest of the steps 1 v = r + Uv v = ( I U) r Optimal policy The value of the optimal policy V ( s) = max R( s, + γ P( s' s, V ( s') a A s' S expected one step reward for the first action expected discounted reward for following the opt. policy for the rest of the steps Value function mapping form: V ( s ) = ( HV )( s ) The optimal policy: π : S A π ( s) = arg max Rsa (, ) + γ Ps ( ' sav, ) ( s') a A s' S

Computing optimal policy Three methods: Value iteration Policy iteration Linear program solutions Value iteration Method: computes the optimal value function first then the policy iterative approximation converges to the optimal value function Value iteration ( ε ) initialize V ;; V is vector of values for all states repeat set V' V set V HV until V' V ε output π ( s) = arg max R ( s, + γ P( s' s, V ( s' ) a A s ' S

Policy iteration Method: Takes a policy and computes its value Iteratively improves the policy, till policy cannot be further improved Policy iteration initialize a policy µ repeat set π µ compute π V compute µ ( s) = arg max R( s, + γ P ( s' s, V a A s ' S until π µ output π π ( s' ) Linear program solution Method: converts the problem as a linear program problem Let v s denotes V(s) Linear program: Subject to: v max v s s S R( s, + γ P ( s' s, s v s ' s ' S s S, a A