Algorithms for MDPs and Their Convergence

Similar documents
MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Real Time Value Iteration and the State-Action Value Function

Q-Learning and Stochastic Approximation

Planning in Markov Decision Processes

Understanding (Exact) Dynamic Programming through Bellman Operators

Prioritized Sweeping Converges to the Optimal Value Function

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Distributed Optimization. Song Chong EE, KAIST

Lecture 5: The Bellman Equation

Reinforcement Learning and Control

, and rewards and transition matrices as shown below:

Lecture 4: Approximate dynamic programming

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Internet Monetization

Reinforcement Learning

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Markov Decision Processes and Dynamic Programming

6 Basic Convergence Results for RL Algorithms

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Decision Theory: Q-Learning

Topic: Balanced Cut, Sparsest Cut, and Metric Embeddings Date: 3/21/2007

Optimal Stopping Problems

Approximate Dynamic Programming

Markov Decision Processes (and a small amount of reinforcement learning)

MDP Preliminaries. Nan Jiang. February 10, 2019

1 Stochastic Dynamic Programming

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lectures 6, 7 and part of 8

Decision Theory: Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes

A Gentle Introduction to Reinforcement Learning

Online Learning, Mistake Bounds, Perceptron Algorithm

An Adaptive Clustering Method for Model-free Reinforcement Learning

CSE250A Fall 12: Discussion Week 9

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Markov Decision Processes Infinite Horizon Problems

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

The Art of Sequential Optimization via Simulations

Reinforcement Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Lecture notes for Analysis of Algorithms : Markov decision processes

Reinforcement Learning

REINFORCEMENT LEARNING

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Lecture 24 November 27

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

MAT 771 FUNCTIONAL ANALYSIS HOMEWORK 3. (1) Let V be the vector space of all bounded or unbounded sequences of complex numbers.

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

arxiv: v1 [math.oc] 9 Oct 2018

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Game Theory: Lecture 3

Stochastic Shortest Path Problems

Reinforcement Learning

Basic Deterministic Dynamic Programming

Lecture 6: Communication Complexity of Auctions

EE364a Review Session 5

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Now, the above series has all non-negative terms, and hence is an upper bound for any fixed term in the series. That is to say, for fixed n 0 N,

Lecture 9: Policy Gradient II 1

Optimization methods

Approximate Dynamic Programming

Markov Decision Processes

Lecture 9: Policy Gradient II (Post lecture) 2

Q-Learning for Markov Decision Processes*

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why?

1 MDP Value Iteration Algorithm

Off-Policy Actor-Critic

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

16.410/413 Principles of Autonomy and Decision Making

On the Convergence of Optimistic Policy Iteration

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 1: March 7, 2018

Stochastic Primal-Dual Methods for Reinforcement Learning

5. Solving the Bellman Equation

Introduction to Functional Analysis

Value and Policy Iteration

M17 MAT25-21 HOMEWORK 6

Markov Decision Processes Chapter 17. Mausam

Lecture 4 Lebesgue spaces and inequalities

Lecture 14: October 11

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Algorithms for Constrained Optimization

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Transcription:

MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture that we define two Bellman operators The first (T V )(s) R(s a) + s S P sa (s )V (s ) } The second (T π V )(s) = [ π(a s) R(s a) + ] P sa (s )V (s ) s S which is associated with policy π We refer to V = T V and V π = T π V as the Bellman equations The Bellman operators T and T π satisfy two properties: monotonicity and contraction Proposition For all V V π with V V T V T V and T π V T π V Proof We refer the reader to Lecture for this proof In proving the second property that T and T π are contraction operators we first introduce two key concepts: maximum survival time and weighted max-norm Definition 2 The maximum survival time of state s denoted is defined as π E π [ s 0 = s] < Intuitively can be thought of as how far we must look ahead in order to plan well in the worst case given that we start at state s Note that we can equivalently express as E π s 0 = s + } P sa (s )τ(s ) π t= We use this last equation in Propositions 4 and 5 Definition 3 The weighted max-norm of V is defined as Proposition 4 For all V V V /τ V (s) T V T V /τ α V V /τ where α <

Proof First recall from Definition 2 that the maximum survival time of state s denoted satisfies } Rearranging this expression yields max = + max + P sa (s )τ(s ) } P sa (s )τ(s ) } P sa (s )τ(s ) = = max = α by definition of α Then by definition of T and /τ we have T V T V /τ = max R(s a) + } P sa (s )V (s ) max R(s a) + P sa (s )V (s )} s S s S /τ max R(s a) + } P sa (s )V (s ) max R(s a) + P sa (s )V (s )} s S s S max R(s a) + P sa (s )V (s ) R(s a) P sa (s )V (s ) max P sa (s )V (s ) P sa (s )V (s ) max P sa (s ) (V (s ) V (s )) max P sa (s ) V (s ) V (s ) τ(s τ(s ) ) as required max max α V V /τ = α V V /τ Proposition 5 For all V V π P sa (s ) V (s ) V (s ) τ(s τ(s ) ) P sa (s )τ(s V (s ) V (s ) ) max s S τ(s ) T π V T π V /τ α V V /τ 2

where α < Proof The proof is analogous to the proof for Proposition 4 above Proposition 6 T has a unique fixed point Note that Proposition 6 is true for any contraction and is known as the contraction mapping theorem or the Banach fixed-point theorem Proof Existence: Consider the sequence V i } where V i = T V i beginning with V 0 (ie the sequence V T V T 2 V ) By Proposition 4 T V T 2 V /τ α V T V /τ T 2 V T 3 V /τ α T V T 2 V /τ α 2 V T V /τ T k V T k+ V /τ α k V T V /τ Since α < the sequence V T V T 2 V is a Cauchy sequence and as it is over a Euclidean space which is complete the sequence must converge As the sequence converges there exists some final V such that V = T V Uniqueness: Suppose some other ˆV V is a fixed point of T We then have ˆV = T ˆV and V ˆV = T V T ˆV /τ /τ α V ˆV /τ where the inequality follows from Proposition 4 unique Proposition 7 For all π T π has a unique fixed point Proof The proof is analogous to the proof for Proposition 6 above As α < we must have V = ˆV and therefore ˆV is Note that Propositions 6 and 7 are not necessarily true for MDP s with infinite state spaces as there can be value functions with infinite norm With these two properties monotonicity and contraction of T and T π in hand we can next prove that both policy and value iteration converge 2 Planning Algorithms 2 Value Iteration Algorithm Value Iteration Initialize V 0 for k = 0 2 do V k+ = T V k Proposition 8 V k converges to V Proof This is a Corollary of Proposition 6 3

22 Policy Iteration Algorithm 2 Policy Iteration Initialize π 0 for k = 0 2 do Solve V k = T πk V k Select π k+ st T πk+ V k = T V k Note that if V k = V then π k+ is optimal and if π k is optimal then V k = V Proposition 9 In policy iteration there exists k 0 such that V k = V for all k k 0 Proof First V k = T πk V k by Algorithm 2 and T πk V k T V k by definition of T πk and T Furthermore T V k = T πk+ V k by Algorithm 2 and therefore V k T πk+ V k T 2 π k+ V k T 3 π k+ V k by the monotonicity of T πk+ by the monotonicity of T πk+ V k+ where the last inequality follows from the fact that V k+ is the fixed point of T πk+ and therefore the sequence must converge to V k+ Thus the sequence V 0 V } is a monotonically increasing sequence bounded above by V and the sequence must therefore converge As there are finitely many deterministic policies this convergence must happen in finite time Now suppose the sequence of V s converges to V k We then have V k+ = T πk+ V k+ and V k = T πk+ V k This implies V k = T V k and therefore V k = V Thus the sequence V 0 V } converges to V as required 23 Linear Programming We look to solve the following linear program (LP) which is the dual of the LP presented in Lecture : min ρ(s)v (s) V where ρ(s) is the initial state distribution st V T V Proposition 0 If ρ(s) > 0 for all s S then V is the unique optimum Note that since V does not depend on ρ(s) if necessary we can enforce ρ(s) > 0 for all s S by simply replacing ρ(s) in the objective with another nonzero function without changing the optimal V Proof First V is feasible as V = T V thus satisfying the constraint Now for any feasible V V T V T 2 V T k V by the monotonicity of T Furthermore since T is a contraction with fixed point V (ie T V = V ) the sequence V T V must converge to V Therefore for all feasible V V dominates V (ie V V ) Thus V minimizes the objective ρ(s)v (s) and V is the optimal solution as required 4

24 Asynchronous Value Iteration Algorithm 3 Asynchronous Value Iteration Initialize V 0 for k = 0 2 do Select some state s k S V (s k ) := (T V )(s k ) Note that unlike in Algorithm where we update each state synchronously here we update only a single state at a time Proposition If for all s S s appears in the sequence (s 0 s ) infinitely often then V converges to V Proof Homework 5