, and rewards and transition matrices as shown below:

Similar documents
Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

REINFORCEMENT LEARNING

Decision Theory: Q-Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Planning in Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning II

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Reinforcement Learning

Internet Monetization

Reinforcement Learning. Introduction

Grundlagen der Künstlichen Intelligenz

Real Time Value Iteration and the State-Action Value Function

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 3: Markov Decision Processes

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Probabilistic Planning. George Konidaris

Artificial Intelligence

1 MDP Value Iteration Algorithm

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Understanding (Exact) Dynamic Programming through Bellman Operators

Markov decision processes

16.410/413 Principles of Autonomy and Decision Making

An Adaptive Clustering Method for Model-free Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning and NLP

Sequential Decision Problems

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

A Gentle Introduction to Reinforcement Learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning

Basics of reinforcement learning

Q-Learning in Continuous State Action Spaces

Final Exam December 12, 2017

Chapter 3: The Reinforcement Learning Problem

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning and Deep Reinforcement Learning

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Reinforcement Learning: the basics

Trust Region Policy Optimization

Distributed Optimization. Song Chong EE, KAIST

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Temporal difference learning

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning. Machine Learning, Fall 2010

Definition of Value, Q function and the Bellman equations

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Final Exam December 12, 2017

Reinforcement Learning and Control

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning

Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Markov decision processes and interval Markov chains: exploiting the connection

Lecture 4: Approximate dynamic programming

AM 121: Intro to Optimization Models and Methods: Fall 2018

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Lecture 1: March 7, 2018

6 Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Notes on Reinforcement Learning

Introduction to Reinforcement Learning

Machine Learning I Reinforcement Learning

Markov Decision Processes

Bias-Variance Error Bounds for Temporal Difference Updates

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Elements of Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Linearly-solvable Markov decision problems

Some AI Planning Problems

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning

Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Partially Observable Markov Decision Processes (POMDPs)

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

(Deep) Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning. George Konidaris

Chapter 3: The Reinforcement Learning Problem

1 Markov decision processes

Transcription:

CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount factor γ =, and rewards and transition matrices as shown below: s R(s) 0 - s s P (s s, a=0) 0 0 0 0 s s P (s s, a=) 0 0 0 0 (a) Consider the policy π that chooses the action a = in each state. For this policy, solve the linear system of Bellman equations to compute the state-value function V π (s) for s {0, }. Your answers should complete the following table. s π(s) V π (s) 0 (b) Compute the greedy policy π (s) with respect to the state-value function V π (s) from part (a). Your answers should complete the following table. s π(s) π (s) 0

7. Value and policy iteration In this problem, you will use value and policy iteration to find the optimal policy of the MDP demonstrated in class. This MDP has S = 8 states, A = actions, and discount factor γ = 0.99. Download the ASCII files on the course web site that store the transition matrices and reward function for this MDP. The transition matrices are stored in a sparse format, listing only the row and column indices with non-zero values; if loaded correctly, the rows of these matrices should sum to one. Turn in your source code along with your answers to the following questions. (a) Compute the optimal state value function V (s) using the method of value iteration. Print out a list of the non-zero values of V (s). Compare your answer to the numbered maze shown below. The correct value function will have positive values at all the numbered squares and negative values at the all squares with dragons. (b) Compute the optimal policy π (s) from your answer in part (a). Interpret the four actions in this MDP as moves to the WEST, NORTH, EAST, and SOUTH. Fill in the correspondingly numbered squares of the maze with arrows that point in the directions prescribed by the optimal policy. Turn in a copy of your solution for the optimal policy, as visualized in this way. (c) Compute the optimal policy π (s) using the method of policy iteration. For the numbered squares in the maze, does it agree with your result from part (b)?

7. Effective horizon time Consider a Markov decision process (MDP) whose rewards r t [0, ] are bounded between zero and one. Let h = ( γ) define an effective horizon time in terms of the discount factor 0 γ. Consider the approximation to the (infinite horizon) discounted return, r 0 + γr + γ r + γ r + γ r +..., obtained by neglecting rewards from some time t and beyond. Recalling that log γ γ, show that the error from such an approximation decays exponentially as: γ k r k he t/h. k t Thus, we can view MDPs with discounted returns as similar to MDPs with finite horizons, where the finite horizon h=( γ) grows as γ. This is a useful intuition for proving the convergence of many algorithms in reinforcement learning. 7. Convergence of iterative policy evaluation Consider an MDP with transition matrices P (s s, a) and reward function R(s). In class, we showed that the state value function V π (s) for a fixed policy π satisfies the Bellman equation: V π (s) = R(s) + γ s P (s s, π(s))v π (s ). For γ <, one method to solve this set of linear equations is by iteration. Initialize the state value function by V 0 (s)=0 for all states s. The update rule at the k th iteration is given by: V k+ (s) R(s) + γ s P (s s, π(s))v k (s ). Use a contraction mapping to derive an upper bound on the error k = max s V k (s) V π (s) after k iterations of the update rule. Your result should show that the error k decays exponentially fast in the number of iterations, k, and hence that lim k V k (s)=v π (s) for all states s.

7.5 Value function for a random walk Consider a Markov decision process (MDP) with discrete states s {0,,,..., } and rewards R(s) = s that grow linearly as a function of the state. Also, consider a policy π whose action in each state either leaves the state unchanged or yields a transition to the next highest state: P (s s, π(s)) = if s = s if s = s+ 0 otherwise Intuitively, this policy can be viewed as a right-drifting random walk. As usual, the value function for this policy, V π (s) = E [ t=0 γ t R(s t ) s 0 = s ], is defined as the expected sum of discounted rewards starting from state s. (a) Assume that the value function V π (s) satisfies a Bellman equation analogous to the one in MDPs with finite state spaces. Write down the Bellman equation satisfied by V π (s). (b) Show that one possible solution to the Bellman equation in part (a) is given by the linear form V π (s) = αs + β, where α and β are coefficients that you should express in terms of the discount factor γ. (Hint: substitute this solution into both sides of the Bellman equation, and solve for α and β by requiring that both sides are equal for all values of s.) (c) Extra credit: justify that the value function V π (s) has this linear form. (In other words, rule out other possible solutions to the Bellman equation for this policy.)

7.6 Explore or exploit Consider a Markov decision process (MDP) with discrete states s {,,..., n}, binary actions a {0, }, deterministic rewards R(s), and transition probabilities: P (s s, a= 0) = n for all s, P (s s, a= ) = { for s=s, 0 otherwise. In this MDP, the explore action a = 0 leads to uniformly random exploration of the state space, while the exploit action a= leaves the agent in its current state. The simple structure of this MDP allows one to calculate its value functions, as well as the form of its optimal policy. This problem guides you through those calculations. (a) Local reward-mining As usual, the state-value function V π (s) is defined as the expected sum of discounted rewards starting from state s at time t=0 and taking action π(s t ) at time t: [ ] V π (s) = E π γ t R(s t ) s 0 = s. t=0 Consider the subset of states S = {s π(s)=} in which the agent chooses the exploit action. Compute V π (s) for states s S by evaluating the expected sum of discounted rewards. (b) Bellman equation for exploration Consider the subset of states S 0 = {s π(s)=0} in which the agent chooses the explore action. Write down the Bellman equation satisfied by the value function V π (s) for states s S 0. Your answer should substitute in the appropriate transition probabilities from these states. (c) State-averaged value function Assume that the reward function, averaged over the state space, has zero mean: n s R(s) = 0. Let µ = n S 0 denote the fraction of states in which the agent chooses to explore. Let r = n Let v = n s S R(s) denote the average reward in the remaining states. s V π (s) denote the expected return if the initial state is chosen uniformly at random. Combining your results from parts (a-b), show that the state-averaged expected return v is given by: v = γr ( γ)( γµ) 5

(d) Value of exploration Using the results from parts (b) and (c), express the state-value function V π (s) for states s S 0 in terms of the reward function R(s), the discount factor γ, and the quantities r and µ. (e) Optimal policy The form of this MDP suggests the following intuitive strategy: in states with low rewards, opt for random exploration at the next time step, while in states with high rewards, opt to remain in place, exploiting the current reward for all future time. In this problem, you will confirm this intuition. Starting from the observation that an optimal policy is its own greedy policy, infer that the optimal policy π in this MDP takes the simple form: π (s) = { if V (s) > θ 0 if V (s) θ where the threshold θ is a constant, independent of the state s. As part of your answer, express the threshold θ in terms of the state-averaged optimal value function v = n s V (s). 7.7 Stochastic approximation In this problem, you will analyze an incremental update rule for estimating the mean µ = E[X] of a random variable X from samples {x, x, x,...}. Consider the incremental update rule: µ k µ k + α k (x k µ k ), with the initial condition µ 0 = 0 and step sizes α k = /k. (a) Show that the step sizes α k = /k obey the conditions (i) k= α k = and (ii) k= α k <, thus guaranteeing that the update rule converges to the true mean. (You are not required to prove convergence.) (b) Show that for this particular choice of step sizes, the incremental update rule yields the same result as the sample average: µ k = (/k)(x + x +... + x k ). 6