RECURSION EQUATION FOR

Similar documents
Motivation for introducing probabilities

1 Markov decision processes

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Elementary maths for GMT

MATH Mathematics for Agriculture II

Optimal Stopping Problems

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Decision Theory: Q-Learning

Markov Decision Processes

Decision Theory: Markov Decision Processes

Markov decision processes

Random Times and Their Properties

ECE-517: Reinforcement Learning in Artificial Intelligence. Lecture 4: Discrete-Time Markov Chains

Lecture notes for Analysis of Algorithms : Markov decision processes

MATH 56A SPRING 2008 STOCHASTIC PROCESSES

Markov Chains and Related Matters

1 Gambler s Ruin Problem

Finite Math - J-term Section Systems of Linear Equations in Two Variables Example 1. Solve the system

Classification of Countable State Markov Chains

Markov decision processes and interval Markov chains: exploiting the connection

Uncertainty Per Krusell & D. Krueger Lecture Notes Chapter 6

Portfolio Optimization

CLASS 12 ALGEBRA OF MATRICES

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

The Boundary Problem: Markov Chain Solution

MATH 304 Linear Algebra Lecture 10: Linear independence. Wronskian.

On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes

Reinforcement Learning

Markov Chains Handout for Stat 110

STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

Sample questions for COMP-424 final exam

Discrete-Time Markov Decision Processes

MATH 445/545 Homework 1: Due February 11th, 2016

56:171 Operations Research Final Exam December 12, 1994

Mathematical Games and Random Walks

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning and Deep Reinforcement Learning

IEOR 6711, HMWK 5, Professor Sigman

Stochastic Shortest Path Problems

Linear Algebra V = T = ( 4 3 ).

Programmers A B C D Solution:

3. Replace any row by the sum of that row and a constant multiple of any other row.

9 - Markov processes and Burt & Allison 1963 AGEC

, and rewards and transition matrices as shown below:

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Dynamic stochastic game and macroeconomic equilibrium

1/2 1/2 1/4 1/4 8 1/2 1/2 1/2 1/2 8 1/2 6 P =

Linear Algebra: Lecture Notes. Dr Rachel Quinlan School of Mathematics, Statistics and Applied Mathematics NUI Galway

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Matrix Multiplication

MATH 445/545 Test 1 Spring 2016

A Computational Method for Multidimensional Continuous-choice. Dynamic Problems

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Some notes on Markov Decision Theory

AM 121: Intro to Optimization Models and Methods

Definition of Value, Q function and the Bellman equations

Power Allocation over Two Identical Gilbert-Elliott Channels

Control Theory : Course Summary

OR MSc Maths Revision Course

Matrices BUSINESS MATHEMATICS

Grundlagen der Künstlichen Intelligenz

Economics 2010c: Lectures 9-10 Bellman Equation in Continuous Time

Chapter 16 focused on decision making in the face of uncertainty about one future

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

The Mabinogion Sheep Problem

LECTURE 3. Last time:

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Learning Tetris. 1 Tetris. February 3, 2009

Birgit Rudloff Operations Research and Financial Engineering, Princeton University

Q-Learning for Markov Decision Processes*

Math 141:512. Practice Exam 1 (extra credit) Due: February 6, 2019

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Math 3191 Applied Linear Algebra

Phys 201. Matrices and Determinants

Artificial Intelligence & Sequential Decision Problems

General Boundary Problem: System of Equations Solution

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Elementary Linear Algebra

7.5 Operations with Matrices. Copyright Cengage Learning. All rights reserved.

Lecture 13 Spectral Graph Algorithms

Page Points Score Total: 100

I&C 6N. Computational Linear Algebra

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Markov Chains, Random Walks on Graphs, and the Laplacian

18.600: Lecture 32 Markov Chains

Sequential Decision Problems

Mathematics 13: Lecture 10

Elements of Reinforcement Learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Review of Basic Concepts in Linear Algebra

From Satisfiability to Linear Algebra

Discrete-Time Finite-Horizon Optimal ALM Problem with Regime-Switching for DB Pension Plan

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Methods for Solving Linear Systems Part 2

6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE

Transcription:

Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u E n 0 n r X n, u X n X 0 i. The optimal value function is W i max u W i, u. u* is an optimal policy iff each initial state i, W i, u W i. RECURSION EQUATION FOR W i, u. W i, u r i, u i j T u i i, j W j, u. W i W i max a A r i, a j T a i, j W j. DYNAMIC PROGRAMMING EQUATION FOR. Let u* be the policy such that u i the a which gives the max value for W i. Then u* is an optimal and W i, u W i.

Approximation procedure for W i : Suppose the states are S i 0, i, i,.... Let w be the column vector of values values W(i): w w 0, w, w,... T W i 0, W i, W i,... T (superscript T means transpose). Let w 0 0, 0, 0,... T. Define w n by: w n i max a r i,a j T a i,j w n i. This sequence converges to W i (by Lecture 8). This w n i determines a maximizing action u i = a. Continue the approximations until this action at one stage is the same as in the next. Check to see if the policy u i is an optimal policy. u i is optimal iff W i, u W i iff W i, u satisfies the dynamic programming equation for W i iff W i,u max a A r i,a j T a i,j W j,u. If not optimal, continue the approximation procedure. If the dynamic programming equation does hold for all states i, then u i is the optimal policy u* and W i W i, u is the exact value of W i.

Now compute W i, u from u via matrix algebra. Suppose the state space is S = {0,,,...}. LEMMA. Suppose u is the action policy. Let T u i, j be the transition matrix for going from state i to state j according to u. Let r u r 0, u 0, r, u, r, u,... T be the rewards when following policy u. Let X x 0, x, x,... T W 0, u, W, u, W, u,... T be the expected values. Then X is the solution to the matrix equation I T u X r u Proof. Since X i W i, u, the equation W i,u r i,u i j T u i i,j W j,u for all i, becomes,

Now compute W i, u from u via matrix algebra. Suppose the state space is S = {0,,,...}. LEMMA. Suppose u is the action policy. Let T u i, j be the transition matrix for going from state i to state j according to u. Let r u r 0, u 0, r, u, r, u,... T be the rewards when following policy u. Let X x 0, x, x,... T W 0, u, W, u, W, u,... T be the expected values. Then X is the solution to the matrix equation I T u X r u Proof. Since X i W i, u, the equation W i,u r i,u i j T u i i,j W j,u for all i, becomes, X i r i,u i j T u i i,j X j and hence..., X i..., T..., r i,u i j T u i i,j X j,... T. Note that j T u i i,j X j is the matrix product T u X. Hence X r u T u X. Hence I T u X r u.

`A Markov decision problem has two states i=, and two actions a = c, d. The discount is a =.9. The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The transition matrices for actions a = c, d are T c / /3 / /3 w() w() T d /4 /3 3/4 /3 Which state does c favor? Which state does d favor? Find an optimal policy u i. Solution. w n i max a c,d r i, a j T a i, j w n i w n i max r i,c.9 j T c i,j w n i, r i,d.9 j T d i,j w n i w n?? w n?? w() w() w 0 w 0?? w?? with action u?? w?? with action u??

`A Markov decision problem has two states i=, and two actions a = c, d. The discount is a =.9. The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The transition matrices for actions a = c, d are T c / /3 / /3 w() w() T d /4 /3 3/4 /3 Find an optimal policy u i. Solution. w n i max a r i,a j T a i,j w n i w n i max r i,c.9 j T c i,j w n i, r i,d.9 j T d i,j w n i w n max 5.9 w n w n, 4.9 4 w n 3 4 w n w n max.9 3 w n 3 w n, 3.9 3 w n 3 w n w 0 w 0 0 w() w()

w max 5.9 0 0, 4.9 0 0 max 5, 4 5, with action u c w max.9 0 0, 3.9 0 0 max, 3 3, with action u d w max 5.9 0 0, 4.9 0 0 max 5, 4 5, with action u c w max.9 0 0, 3.9 0 0 max, 3 3, with action u d w n max 5.9 w n w n, 4.9 4 w n 3 4 w n w n max.9 3 w n 3 w n, 3.9 3 w n 3 w n c d w??... max 8.6, 7.5 with action u?? w??... max 5.9, 6.3 with action u??

w max 5.9 5 3, 4.9 4 5 3 4 3 max 5.9 4, 4.9 4 86 4 max 8.6, 7.5 0 43 5 with action u c. w max.9 3 5 3 3, 3.9 3 5 3 3 max.9 3 3,3.9 63 3 max 5.9, 6.3 0 with action u d. T c T d / / w() /4 3/4 w() /3 /3 w() /3 /3 w() For both w and w, u and u are the same. Hence we test to see if u c, u d is an optimal policy.???? For this u, T u.????

w max 5.9 5 3, 4.9 4 5 3 4 3 max 5.9 4, 4.9 4 86 4 max 8.6, 7.5 0 43 5 with action u c. w max.9 3 5 3 3, 3.9 3 5 3 3 max.9 3 3,3.9 63 3 max 5.9, 6.3 0 with action u d. T c T d / / w() /4 3/4 w() /3 /3 w() /3 /3 w() For both w and w, u and u are the same. Hence we test to see if u c, u d is an optimal policy. / / For this u, T u. /3 /3 The first row of T u i i, j is from T c since u c The second row is from since u d. T d

The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The policy is u c, u d. Thus r u r, u, r, u T =??

r u r, u, r, u T r, c, r, d T 5, 3 T. The equation I T u X r u becomes.9.9 x which is the system.9 3.9 5 3 y 3 0 x 9 0 y 5 x 9y 00 3 0 x 4 0 y 3 3x 4y 30 With solution x 670 7 W, u, y 630 7 W, u. Now check if W i, u satisfies the dynamic programming equation for W i : W i, u max a A r i, a j T a i, j W j, u. If so, then W i W i, u and u is optimal. For i = we get W, u max a A r, a.9 j T a, j W j, u iff 670 7 max 5.9, 39.4 max 39.4, 37.8 Likewise for i =. 670 7 630 7 iff true. 670,4.9 4, 3 4 7 630 7 iff

Suppose you have invested in a stock. When should you sell it? To make money, you must sell if for more than the amount originally paid. When the stock is selling at a high price, should you sell or wait for a higher price? For the case of the unpredictable stock market the problem is intractable. We solve this problem for Markov chains with known transition probabilities. S = {0,,,... } be the set of states for a Markov chain. T i, j is the probability of going from i to j assuming you choose to continue. r i is the reward to state i. This reward is collected iff you stop at that state. At each state i, we chose an action ai{s, c}. Choose a = s if you wish to stop and collect the state s reward. Thus r i, s r i. Choose a = c if you wish to continue. You skip the state s reward. Thus r i, c 0.

You continue from state to state according to the transition probabilities until you stop at a state and collect its reward. A solution to a stopping problem is a policy u i which determines, for each state i, if you should stop ( u i s) and collect the reward r i, s r i or if you should continue ( u i c) and skip the reward r i, c 0 (hoping for getting better). Let W i be the expected reward if you start at i and act optimally. Clearly W i r i. W i r i if you stop at i instead of continuing on to other states.

` For each state i decide if we should stop and collect the reward, or continue. The bracketed numbers [ r(i)] are the rewards. Put the correct action u(i) in the space between ( ) s. At the bottom of each node list the expected value W i if you start at i and act optimally. Convention: unlabeled arrows have probability. You must stop sometime, infinite loops are not allowed. []( ) /3 []( ) [3]( ) /3 [0]( ) /3 /3 /3 [0]( ) / 3/4 [5]( ) /4 / [6]( )

[]( ) [6]( ) / / [6]( ) 3[5]( ) []( ) [4]( ) / [0]( ) / [0]( ) / [0]( ) RULES FOR INITIAL KNOWN VALUES. What should you do in each case? v If r i is the maximum reward. v If r i is > all rewards reachable from i. v v If i is a dead-end state (you can't leave). If i is in an irreducible class. v If all next states j have known values W j. /