Markov Decision Processes Chapter 17. Mausam

Similar documents
Markov Decision Processes Chapter 17. Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CS 7180: Behavioral Modeling and Decisionmaking

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning and Control

Internet Monetization

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning. Introduction

Some AI Planning Problems

Stochastic Safest and Shortest Path Problems

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov decision processes

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Introduction to Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

1 Markov decision processes

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Planning in Markov Decision Processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CSE250A Fall 12: Discussion Week 9

AM 121: Intro to Optimization Models and Methods: Fall 2018

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Markov Decision Processes Infinite Horizon Problems

MDP Preliminaries. Nan Jiang. February 10, 2019

Markov Decision Processes

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Chapter 3: The Reinforcement Learning Problem

Basics of reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Real Time Value Iteration and the State-Action Value Function

Distributed Optimization. Song Chong EE, KAIST

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

An Introduction to Markov Decision Processes. MDP Tutorial - 1

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 3: Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

, and rewards and transition matrices as shown below:

Stochastic Shortest Path Problems

Chapter 16 Planning Based on Markov Decision Processes

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

The Markov Decision Process (MDP) model

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Reinforcement Learning and Deep Reinforcement Learning

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Probabilistic Planning. George Konidaris

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement learning

Decision Theory: Markov Decision Processes

Final Exam December 12, 2017

CS599 Lecture 1 Introduction To RL

Approximation Methods in Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning I Reinforcement Learning

Laplacian Agent Learning: Representation Policy Iteration

Decision making, Markov decision processes

Elements of Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Discrete planning (an introduction)

CS 4100 // artificial intelligence. Recap/midterm review!

The Reinforcement Learning Problem

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning II

Reinforcement Learning

Lecture notes for Analysis of Algorithms : Markov decision processes

Markov Decision Processes and Dynamic Programming

1 MDP Value Iteration Algorithm

Q-learning. Tambet Matiisen

Final Exam December 12, 2017

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Lecture 1: March 7, 2018

Reinforcement Learning

Markov Decision Processes (and a small amount of reinforcement learning)

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Q-Learning for Markov Decision Processes*

Preference Elicitation for Sequential Decision Problems

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

RL 14: POMDPs continued

Handout 1: Introduction to Dynamic Programming. 1 Dynamic Programming: Introduction and Examples

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

CSC321 Lecture 22: Q-Learning

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

On the Policy Iteration algorithm for PageRank Optimization

Homework 2: MDPs and Search

Notes on Reinforcement Learning

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

REINFORCEMENT LEARNING

Transcription:

Markov Decision Processes Chapter 17 Mausam

Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs. Durative Percepts Actions 2

Classical Planning Static Environment Fully Observable Perfect What action next? Deterministic Instantaneous Percepts Actions 3

Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions 4

MDP vs. Decision Theory Decision theory episodic MDP -- sequential 5

Markov Decision Process (MDP) S: A set of states A: A set of actions T(s,a,s ): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model factored Factored MDP absorbing/ non-absorbing 6

Objective of an MDP Find a policy : S A which optimizes minimizes maximizes maximizes discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability 7

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r 2 + 2 r 3 + Total cost: c 1 + c 2 + 2 c 3 + 8

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP <S, A, T, C, G, s 0 > Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP <S, A, T, R, > Most often studied in machine learning, economics, operations research communities most popular Oversubscription Planning: Non absorbing goals, Reward Max. MDP <S, A, T, G, R, s 0 > Relatively recent model 9

Acyclic vs. Cyclic MDPs a P b 0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5 a P b Q R S T R S T c c c c c c c G C(a) = 5, C(b) = 10, C(c) =1 Expectimin works V(Q/R/S/T) = 1 V(P) = 6 action a G Expectimin doesn t work infinite loop V(R/S/T) = 1 Q(P,b) = 11 Q(P,a) =???? suppose I decide to take a in P Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a) 10 = 13.5

Brute force Algorithm Go over all policies ¼ How many? A S finite Evaluate each policy V ¼ (s) Ã expected cost of reaching goal from s how to evaluate? Choose the best We know that best exists (SSP optimality principle) V ¼ *(s) V ¼ (s) 11

Policy Evaluation Given a policy ¼: compute V ¼ V ¼ : cost of reaching goal while following ¼ 12

Deterministic MDPs Policy Graph for ¼ ¼(s 0 ) = a 0 ; ¼(s 1 ) = a 1 C=5 C=1 s 0 s 1 s a g 0 a 1 V ¼ (s 1 ) = 1 V ¼ (s 0 ) = 6 add costs on path to goal 13

Acyclic MDPs Policy Graph for ¼ s 0 V ¼ (s 1 ) = 1 V ¼ (s 2 ) = 4 Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 C=1 C=4 V ¼ (s 0 ) = 0.6(5+1) + 0.4(2+4) = 6 a 2 s g backward pass in reverse topological order 14

General MDPs can be cyclic! s 0 Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 a 2 C=1 Pr=0.7 C=4 Pr=0.3 C=3 s g cannot do a simple single pass V ¼ (s 1 ) = 1 V ¼ (s 2 ) =?? (depends on V ¼ (s 0 )) V ¼ (s 0 ) =?? (depends on V ¼ (s 2 )) 15

General SSPs can be cyclic! s 0 Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 V ¼ (g) = 0 V ¼ (s 1 ) = 1+V ¼ (s g ) = 1 V ¼ (s 2 ) = 0.7(4+V ¼ (s g )) + 0.3(3+V ¼ (s 0 )) V ¼ (s 0 ) = 0.6(5+V ¼ (s 1 )) + 0.4(2+V ¼ (s 2 )) a 2 C=1 Pr=0.7 C=4 Pr=0.3 C=3 s g a simple system of linear equations 16

Policy Evaluation (Approach 1) Solving the System of Linear Equations V ¼ (s) = 0 if s 2 G = X s 0 2S T (s; ¼(s); s 0 ) [C(s; ¼(s); s 0 ) + V ¼ (s 0 )] S variables. O( S 3 ) running time 17

Iterative Policy Evaluation 1 4.4+0.4V ¼ (s 2 ) 0 5.88 6.5856 6.670272 6.68043.. s 0 Pr=0.6 C=5 Pr=0.4 C=2 s 1 s 2 a 2 C=1 a 0 a 1 0 3.7+0.3V ¼ (s 0 ) 3.7 5.464 5.67568 5.7010816 5.704129 Pr=0.7 C=4 Pr=0.3 C=3 s g 18

Policy Evaluation (Approach 2) V ¼ (s) = X s 0 2S T (s; ¼(s); s 0 ) [C(s; ¼(s); s 0 ) + V ¼ (s 0 )] iterative refinement V ¼ n (s) Ã X s 0 2S T (s; ¼(s); s 0 ) C(s; ¼(s); s 0 ) + V ¼ n 1(s 0 ) 19

Iterative Policy Evaluation iteration n ²-consistency termination condition 20

Convergence & Optimality For a proper policy ¼ Iterative policy evaluation converges to the true value of the policy, i.e. lim n!1 V ¼ n = V ¼ irrespective of the initialization V 0 21

Policy Evaluation Value Iteration (Bellman Equations for MDP 1 ) <S, A, T, C,G, s 0 > Define V*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. V* should satisfy the following equation: V (s) = 0 if s 2 G X = min T (s; a; s 0 ) [C(s; a; s 0 ) + V (s 0 )] a2a s 0 2S Q*(s,a) V*(s) = min a Q*(s,a) 22

Bellman Equations for MDP 2 <S, A, T, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation: 23

Fixed Point Computation in VI V (s) = min a2a X s 0 2S T (s; a; s 0 ) [C(s; a; s 0 ) + V (s 0 )] iterative refinement V n (s) Ã min a2a X s 0 2S T (s; a; s 0 ) [C(s; a; s 0 ) + V n 1 (s 0 )] non-linear 24

Example s 0 a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g 25

Bellman Backup s 4 s 3 a 41 a 3 C=2 a 40 C=5 Pr=0.6 Pr=0.4 s g min Q 1 (s 4,a 40 ) = 5 + 0 Q 1 (s 4,a 41 ) = 2+ 0.6 0 + 0.4 2 = 2.8 V 1 = 2.8 s 4 a greedy = a 41 C=5 C=2 a 40 a 41 s g V 0 = 0 s 3 V 0 = 2

Value Iteration [Bellman 57] No restriction on initial value function iteration n ²-consistency termination condition 27

Example (all actions cost 1 unless otherwise stated) s 0 a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g n V n (s 0 ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 28

Comments Decision-theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilistic version of Bellman-Ford Algorithm for shortest path computation MDP 1 : Stochastic Shortest Path Problem Time Complexity one iteration: O( S 2 A ) number of iterations: poly( S, A, 1/², 1/(1- )) Space Complexity: O( S ) 31

Monotonicity For all n>k V k p V * V n p V* (V n monotonic from below) V k p V * V n p V* (V n monotonic from above) 32

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value 40

Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat costly: O(n 3 ) Policy Evaluation: compute V n+1 : the evaluation of n Policy Improvement: for all states s compute n+1 (s): argmin a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence in fewer number of iterations. all other properties follow! Modified Policy Iteration approximate by value iteration using fixed policy 41

Modified Policy iteration assign an arbitrary assignment of 0 to each state. repeat Policy Evaluation: compute V n+1 the approx. evaluation of n Policy Improvement: for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage probably the most competitive synchronous dynamic programming algorithm. 42

Applications Stochastic Games Robotics: navigation, helicopter manuevers Finance: options, investments Communication Networks Medicine: Radiation planning for cancer Controlling workflows Optimize bidding decisions in auctions Traffic flow optimization Aircraft queueing for landing; airline meal provisioning Optimizing software on mobiles Forest firefighting 43