Understanding (Exact) Dynamic Programming through Bellman Operators

Similar documents
Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning. Introduction

, and rewards and transition matrices as shown below:

Markov Decision Processes and Solving Finite Problems. February 8, 2017

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Algorithms for MDPs and Their Convergence

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Real Time Value Iteration and the State-Action Value Function

Markov Decision Processes Infinite Horizon Problems

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Internet Monetization

Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Markov Decision Processes and Dynamic Programming

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture 3: Markov Decision Processes

Planning in Markov Decision Processes

REINFORCEMENT LEARNING

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning

Reinforcement Learning

CS 234 Midterm - Winter

Markov Decision Processes and Dynamic Programming

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Stochastic Primal-Dual Methods for Reinforcement Learning

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Probabilistic Planning. George Konidaris

MATH4406 Assignment 5

Stochastic Shortest Path Problems

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Approximate Dynamic Programming

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

AM 121: Intro to Optimization Models and Methods: Fall 2018

Reinforcement Learning and Control

Finite-Sample Analysis in Reinforcement Learning

Partially Observable Markov Decision Processes (POMDPs)

Chapter 3: The Reinforcement Learning Problem

1 MDP Value Iteration Algorithm

Example I: Capital Accumulation

1 Markov decision processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 4: Approximate dynamic programming

Machine Learning I Reinforcement Learning

CSE250A Fall 12: Discussion Week 9

Motivation for introducing probabilities

Prioritized Sweeping Converges to the Optimal Value Function

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Decision Theory: Markov Decision Processes

Decision Theory: Q-Learning

Some AI Planning Problems

CS 7180: Behavioral Modeling and Decisionmaking

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Chapter 4: Dynamic Programming

Trust Region Policy Optimization

Lecture 3: The Reinforcement Learning Problem

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Markov Decision Processes

Chapter 3: The Reinforcement Learning Problem

Least squares policy iteration (LSPI)

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Markov Decision Processes Chapter 17. Mausam

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

Off-Policy Actor-Critic

Reinforcement Learning as Classification Leveraging Modern Classifiers

Lecture 1: March 7, 2018

Artificial Intelligence

Reinforcement Learning

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Markov decision processes

Approximate Dynamic Programming

arxiv: v1 [cs.lg] 6 Jun 2013

Analysis of Classification-based Policy Iteration Algorithms

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Subsequences and the Bolzano-Weierstrass Theorem

Definition of Value, Q function and the Bellman equations

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

CSC321 Lecture 22: Q-Learning

Discrete planning (an introduction)

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes on Reinforcement Learning

Distributed Optimization. Song Chong EE, KAIST

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Reinforcement Learning Part 2

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Multiagent Value Iteration in Markov Games

Reinforcement Learning and NLP

Basic Deterministic Dynamic Programming

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Online regret in reinforcement learning

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Transcription:

Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11

Overview 1 Value Functions as Vectors 2 Bellman Operators 3 Contraction and Monotonicity 4 Policy Evaluation 5 Policy Iteration 6 Value Iteration 7 Policy Optimality Ashwin Rao (Stanford) Bellman Operators January 15, 2019 2 / 11

Value Functions as Vectors Assume State pace S consists of n states: {s 1, s 2,..., s n } Assume Action space A consists of m actions {a 1, a 2,..., a m } This exposition extends easily to continuous state/action spaces too We denote a stochastic policy as π(a s) (probability of a given s ) Abusing notation, deterministic policy denoted as π(s) = a Consider n-dim space R n, each dim corresponding to a state in S Think of a Value Function (VF) v: S R as a vector in this space With coordinates [v(s 1 ), v(s 2 ),..., v(s n )] Value Function (VF) for a policy π is denoted as v π : S R Optimal VF denoted as v : S R such that for any s S, v (s) = max v π(s) π Ashwin Rao (Stanford) Bellman Operators January 15, 2019 3 / 11

Some more notation Denote R a s as the Expected Reward upon action a in state s Denote Ps,s a as the probability of transition s s upon action a Define R π (s) = π(a s) R a s a A P π (s, s ) = a A π(a s) P a s,s Denote R π as the vector [R π (s 1 ), R π (s 2 ),..., R π (s n )] Denote P π as the matrix [P π (s i, s i )], 1 i, i n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Bellman Operators January 15, 2019 4 / 11

Bellman Operators B π and B We define operators that transform a VF vector to another VF vector Bellman Policy Operator B π (for policy π) operating on VF vector v: B π v = R π + γp π v B π is a linear operator with fixed point v π, meaning B π v π = v π Bellman Optimality Operator B operating on VF vector v: (B v)(s) = max a {Ra s + γ Ps,s a v(s )} s S B is a non-linear operator with fixed point v, meaning B v = v Define a function G mapping a VF v to a deterministic greedy policy G(v) as follows: G(v)(s) = arg max{r a s + γ Ps,s a v(s )} a s S B G(v) v = B v for any VF v (Policy G(v) achieves the max in B ) Ashwin Rao (Stanford) Bellman Operators January 15, 2019 5 / 11

Contraction and Monotonicity of Operators Both B π and B are γ-contraction operators in L norm, meaning: For any two VFs v 1 and v 2, B π v 1 B π v 2 γ v 1 v 2 B v 1 B v 2 γ v 1 v 2 So we can invoke Contraction Mapping Theorem to claim fixed point We use the notation v 1 v 2 for any two VFs v 1, v 2 to mean: v 1 (s) v 2 (s) for all s S Also, both B π and B are monotonic, meaning: For any two VFs v 1 and v 2, v 1 v 2 B π v 1 B π v 2 v 1 v 2 B v 1 B v 2 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 6 / 11

Policy Evaluation B π satisfies the conditions of Contraction Mapping Theorem B π has a unique fixed point v π, meaning B π v π = v π This is a succinct representation of Bellman Expectation Equation Starting with any VF v and repeatedly applying B π, we will reach v π lim N BN π v = v π for any VF v This is a succinct representation of the Policy Evaluation Algorithm Ashwin Rao (Stanford) Bellman Operators January 15, 2019 7 / 11

Policy Improvement Let π k and v πk denote the Policy and the VF for the Policy in iteration k of Policy Iteration Policy Improvement Step is: π k+1 = G(v πk ), i.e. deterministic greedy Earlier we argued that B v = B G(v) v for any VF v. Therefore, B v πk = B G(vπk )v πk = B πk+1 v πk (1) We also know from operator definitions that B v B π v for all π, v Combining (1) and (2), we get: B v πk B πk v πk = v πk (2) Monotonicity of B πk+1 implies B πk+1 v πk v πk B N π k+1 v πk... B 2 π k+1 v πk B πk+1 v πk v πk v πk+1 = lim N BN π k+1 v πk v πk Ashwin Rao (Stanford) Bellman Operators January 15, 2019 8 / 11

Policy Iteration We have shown that in iteration k + 1 of Policy Iteration, v πk+1 v πk If v πk+1 = v πk, the above inequalities would hold as equalities So this would mean B v πk = v πk But B has a unique fixed point v So this would mean v πk = v Thus, at each iteration, Policy Iteration either strictly improves the VF or achieves the optimal VF v Ashwin Rao (Stanford) Bellman Operators January 15, 2019 9 / 11

Value Iteration B satisfies the conditions of Contraction Mapping Theorem B has a unique fixed point v, meaning B v = v This is a succinct representation of Bellman Optimality Equation Starting with any VF v and repeatedly applying B, we will reach v lim N BN v = v for any VF v This is a succinct representation of the Value Iteration Algorithm Ashwin Rao (Stanford) Bellman Operators January 15, 2019 10 / 11

Greedy Policy from Optimal VF is an Optimal Policy Earlier we argued that B G(v) v = B v for any VF v. Therefore, B G(v )v = B v But v is the fixed point of B, meaning B v = v. Therefore, B G(v )v = v But we know that B G(v ) has a unique fixed point v G(v ). Therefore, v = v G(v ) This says that simply following the deterministic greedy policy G(v ) (created from the Optimal VF v ) in fact achieves the Optimal VF v In other words, G(v ) is an Optimal (Deterministic) Policy Ashwin Rao (Stanford) Bellman Operators January 15, 2019 11 / 11