Markov Decision Processes Infinite Horizon Problems

Similar documents
Some AI Planning Problems

Decision Theory: Markov Decision Processes

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Decision Theory: Q-Learning

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Understanding (Exact) Dynamic Programming through Bellman Operators

Reinforcement Learning. Introduction

Real Time Value Iteration and the State-Action Value Function

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning Active Learning

Prioritized Sweeping Converges to the Optimal Value Function

Planning in Markov Decision Processes

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

Internet Monetization

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Lecture notes for Analysis of Algorithms : Markov decision processes

Markov Decision Processes Chapter 17. Mausam

Value and Policy Iteration

CS 7180: Behavioral Modeling and Decisionmaking

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Approximate Dynamic Programming

Markov Decision Processes and Dynamic Programming

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

, and rewards and transition matrices as shown below:

Basics of reinforcement learning

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes and Dynamic Programming

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning and Control

Basic Deterministic Dynamic Programming

Reinforcement Learning

Chapter 16 Planning Based on Markov Decision Processes

Motivation for introducing probabilities

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

1 MDP Value Iteration Algorithm

AM 121: Intro to Optimization Models and Methods: Fall 2018

Stochastic Shortest Path Problems

Introduction to Reinforcement Learning

Markov decision processes

On the Policy Iteration algorithm for PageRank Optimization

Approximate Dynamic Programming

Elements of Reinforcement Learning

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

Distributed Optimization. Song Chong EE, KAIST

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CSC321 Lecture 22: Q-Learning

9 - Markov processes and Burt & Allison 1963 AGEC

Stochastic Safest and Shortest Path Problems

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

A Gentle Introduction to Reinforcement Learning

1 Markov decision processes

Infinite-Horizon Discounted Markov Decision Processes

Optimal Stopping Problems

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Approximate Dynamic Programming

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

CSE250A Fall 12: Discussion Week 9

Reinforcement learning

Lecture 4: The Bellman Operator Dynamic Programming

Lecture 23: Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Chapter 3: The Reinforcement Learning Problem

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Chapter 3: The Reinforcement Learning Problem

Partially Observable Markov Decision Processes (POMDPs)

Lecture 1: March 7, 2018

Reinforcement Learning

Preference Elicitation for Sequential Decision Problems

Markov Decision Processes (and a small amount of reinforcement learning)

Algorithms for MDPs and Their Convergence

Finite-Sample Analysis in Reinforcement Learning

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Chapter 3. Dynamic Programming

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Approximation Results on Sampling Techniques for Zero-sum, Discounted Markov Games

Multi-Objective Decision Making

Markov Decision Processes

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Lecture 3: The Reinforcement Learning Problem

Q-Learning in Continuous State Action Spaces

16.4 Multiattribute Utility Functions

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The knowledge gradient method for multi-armed bandit problems

Markov Decision Processes

Dynamic Programming Theorems

Abstract Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Transcription:

Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1

What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an optimal value This depends on how we define the value of a policy There are several choices and the solution algorithms depend on the choice We will consider two common choices Finite-Horizon Value Infinite Horizon Discounted Value 2

Discounted Infinite Horizon MDPs Defining value as total reward is problematic with infinite horizons (r1 + r2 + r3 + r4 +..) many or all policies have infinite expected reward some MDPs are ok (e.g., zero-cost absorbing state Trick : introduce discount factor 0 β < 1 future rewards discounted by β per time step Note: V V ( E [ t0 t0 t R t, s ] t max 1 ( E [ R ] R 1 Bounded Value max Motivation: economic? prob of death? convenience? 3

Notes: Discounted Infinite Horizon Optimal policies guaranteed to exist (Howard, 1960) I.e. there is a policy that maximizes value at each state Furthermore there is always an optimal stationary policy Intuition: why would we change action at s at a new time when there is always forever ahead We define That is, V *( V *( V ( to be the optimal value function. for some optimal stationary π 5

Computational Problems Policy Evaluation Given π and an MDP compute V π Policy Optimization Given an MDP, compute an optimal policy π and V. We ll cover two algorithms for doing this: value iteration and policy iteration 6

Policy Evaluation Value equation for fixed policy V ( R( β T( s, (, V ( s' immediate reward discounted expected value of following policy in the future Equation can be derived from original definition of infinite horizon discounted value 7

Policy Evaluation Value equation for fixed policy V ( R( β T( s, (, V ( s' How can we compute the value function for a fixed policy? we are given R, T, π, Β and want to find V π s for each s linear system with n variables and n constraints Variables are values of states: V(s1),,V(sn) Constraints: one value equation (above) per state Use linear algebra to solve for V (e.g. matrix inverse) 8

Policy Evaluation via Matrix Inverse V π and R are n-dimensional column vector (one element for each state) T is an nxn matrix s.t. V ( I V R βt) V ( I βtv T(i, j) T(si, ( s i ), s j ) R -1 βt) R 10

Computing an Optimal Value Function Bellman equation for optimal value function V *( R( β max T( s, a, V *( s' a immediate reward discounted expected value of best action assuming we we get optimal value in future Bellman proved this is always true for an optimal value function 11

Computing an Optimal Value Function Bellman equation for optimal value function V *( R( β max T( s, a, V *( s' a How can we solve this equation for V*? The MAX operator makes the system non-linear, so the problem is more difficult than policy evaluation Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration Adjust Bellman Backup to take discounting into account. 12

Bellman Backups (Revisited) V k+1 ( Compute Max s Compute Expectations a 1 0.4 0.7 0.3 V k s1 s2 s3 a 2 0.6 s4 V k 1 ( R( max T ( s, a, V s' a k (

Value Iteration Can compute optimal policy using value iteration based on Bellman backups, just like finite-horizon problems (but include discount term) V 0 ( 0 ;; Could also initialize to R( V k ( R( max T ( s, a, V s' k 1 a ( Will it converge to optimal value function as k gets large? Yes. Why? lim V k k V * 14

Convergence of Value Iteration Bellman Backup Operator: define B to be an operator that takes a value function V as input and returns a new value function after a Bellman backup B[ V ]( R( β max T( s, a, V ( s' a Value iteration is just the iterative application of B: V 0 0 V k B[ V k 1 ] 15

Convergence: Fixed Point Property Bellman equation for optimal value function V *( R( β max T( s, a, V *( s' a Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B. That is B[V*]=V* B[ V ]( R( β max T( s, a, V ( s' a 17

Convergence: Contraction Property Let V denote the max-norm of V, which returns the maximum absolute value of the vector. E.g. (0.1-100 5 6) = 100 B[V] is a contraction operator wrt max-norm For any V and V, B[V] B[V ] β V V You will prove this. That is, applying B to any two value functions causes them to get closer together in the maxnorm sense! 18

Convergence Using the properties of B we can prove convergence of value iteration. Proof: 1. For any V: V* - B[V] = B[V*] B[V] β V* - V 2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β V* - V k+1 = V* - B[V k ] β V* - V k 3. This means that V* V k β k V* - V 0 * k 4. Thus lim V V 0 k 19

Value Iteration: Stopping Condition Want to stop when we can guarantee the value function is near optimal. Key property: (not hard to prove) If Vk - Vk-1 ε then Vk V* εβ /(1-β) Continue iteration until Vk - Vk-1 ε Select small enough ε for desired error guarantee 20

How to Act Given a V k from value iteration that closely approximates V*, what should we use as our policy? Use greedy policy: (one step lookahead) k k greedy[ V ]( arg max T( s, a, V ( s' a Note that the value of greedy policy may not be exactly equal to V k Why? 21

How to Act Use greedy policy: (one step lookahead) k k greedy[ V ]( arg max T( s, a, V ( s' a We care about the value of the greedy policy which we denote by V g This is how good the greedy policy will be in practice. How close is V g to V*? 22

Value of Greedy Policy k k greedy[ V ]( arg max T( s, a, V ( s' a Define V g to be the value of this greedy policy This is likely not the same as V k Property: If V k V* λ then V g - V* 2λβ /(1-β) Thus, V g is not too far from optimal if V k is close to optimal Our previous stopping condition allows us to bound λ based on V k+1 V k Set stopping condition so that V g - V* Δ How? 23

Goal: V g - V* Δ Property: If V k V* λ then V g - V* 2λβ /(1-β) Property: If Vk - Vk-1 ε then Vk V* εβ /(1-β) Answer: If Vk - Vk-1 1 Β 2 Δ/(2Β 2 ) then V g - V* Δ

Policy Evaluation Revisited Sometimes policy evaluation is expensive due to matrix operations Can we have an iterative algorithm like value iteration for policy evaluation? Idea: Given a policy π and MDP M, create a new MDP M[π] that is identical to M, except that in each state s we only allow a single action π( What is the optimal value function V* for M[π]? Since the only valid policy for M[π] is π, V* = V π.

Policy Evaluation Revisited Running VI on M[π] will converge to V* = V π. What does the Bellman backup look like here? The Bellman backup now only considers one action in each state, so there is no max We are effectively applying a backup restricted by π Restricted Bellman Backup: B [ V ]( R( β T ( s, (, V ( s'

Iterative Policy Evaluation Running VI on M[π] is equivalent to iteratively applying the restricted Bellman backup. Iterative Policy Evaluation: V 0 0 V k B [ V k 1 ] Convergence: limk V k V Often become close to V π for small k 27

Optimization via Policy Iteration Policy iteration uses policy evaluation as a sub routine for optimization It iterates steps of policy evaluation and policy improvement 1. Choose a random policy π 2. Loop: (a) Evaluate V π (b) π = ImprovePolicy(V π ) (c) Replace π with π Given V π returns a strictly better policy if π isn t optimal Until no improving action possible at any state 28

Policy Improvement Given V π how can we compute a policy π that is strictly better than a sub-optimal π? Idea: given a state s, take the action that looks the best assuming that we following policy π thereafter That is, assume the next state s has value V π (s ) For each s in S, set '( arg max a T( s, a, V s' ( Proposition: V π V π with strict inequality for suboptimal π. 29

For any two value functions V 1 and V 2, we write V 1 V 2 to indicate that for all states s, V 1 s V 2 s. '( arg max a T( s, a, V s' ( Proposition: V π V π with strict inequality for sub-optimal π. Useful Properties for Proof: 1) V π = B π V π 2) B V π = B π V π ;; by the definition of π 3) For any V 1, V 2 and π, if V 1 V 2 then B π V 1 B π [V 2 ] 30

'( arg max a T( s, a, V s' ( Proposition: V π V π with strict inequality for sub-optimal π. Proof: (first part, non-strict inequality) We know that V π = B π V π B V π = B π V π So we have that V π B π V π. Now by monotonicity we get B π V π B π V π k where B π denotes k applications of B π. We can continue and derive that in general for any k, B k π V π B k+1 π V π, which also implies that V π B k π V π for any k. Thus V π lim B k k π V π = V π 31

'( arg max a T( s, a, V s' ( Proposition: V π V π with strict inequality for sub-optimal π. Proof: (part two, strict inequality) We want to show that if π is sub-optimal then V π > V π. We prove the contrapositive if V π > V π then π is optimal. Since we already showed that V π V π we know that the condition of the contrapositive V π > V π is equivalent to V π = V π. Now assume that V π = V π. Combining this with V π = B π V π yields V π = B π V π = B V π. Thus V π satisfies the Bellman Equation and must be optimal. 32

Optimization via Policy Iteration 1. Choose a random policy π 2. Loop: (a) Evaluate V π (b) For each s in S, set '( (c) Replace π with π arg max Until no improving action possible at any state a T ( s, a, V s' ( 33 Proposition: V π V π with strict inequality for sub-optimal π. Policy iteration goes through a sequence of improving policies

Policy Iteration: Convergence Convergence assured in a finite number of iterations Since finite number of policies and each step improves value, then must converge to optimal Gives exact value of optimal policy 34

Policy Iteration Complexity Each iteration runs in polynomial time in the number of states and actions There are at most A n policies and PI never repeats a policy So at most an exponential number of iterations Not a very good complexity bound Empirically O(n) iterations are required often it seems like O(1) Challenge: try to generate an MDP that requires more than that n iterations Recent polynomial bounds. 35

Value Iteration vs. Policy Iteration Which is faster? VI or PI It depends on the problem VI takes more iterations than PI, but PI requires more time on each iteration PI must perform policy evaluation on each iteration which involves solving a linear system VI is easier to implement since it does not require the policy evaluation step But see next slide We will see that both algorithms will serve as inspiration for more advanced algorithms 36

Modified Policy Iteration Modified Policy Iteration: replaces exact policy evaluation step with inexact iterative evaluation Uses a small number of restricted Bellman backups for evaluation Avoids the expensive policy evaluation step Perhaps easier to implement. Often is faster than PI and VI Still guaranteed to converge under mild assumptions on starting points 37

Modified Policy Iteration Policy Iteration 1. Choose initial value function V 2. Loop: (a) For each s in S, set (b) Partial Policy Evaluation Repeat K times: V Until change in V is minimal ( arg max T ( s, a, V ( s' B a [V ] Approx. evaluation

Recap: things you should know What is an MDP? What is a policy? Stationary and non-stationary What is a value function? Finite-horizon and infinite horizon How to evaluate policies? Finite-horizon and infinite horizon Time/space complexity? How to optimize policies? Finite-horizon and infinite horizon Time/space complexity? Why they are correct? 39