This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Similar documents
Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 3: Markov Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Planning in Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning. George Konidaris

6 Reinforcement Learning

Reinforcement Learning. Value Function Updates

Temporal difference learning

Decision Theory: Q-Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Machine Learning I Reinforcement Learning

Lecture 23: Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Decision Theory: Markov Decision Processes

Reinforcement Learning Part 2

The Reinforcement Learning Problem

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Internet Monetization

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Markov decision processes

Reinforcement Learning. Machine Learning, Fall 2010

, and rewards and transition matrices as shown below:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Basics of reinforcement learning

CS 234 Midterm - Winter

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Final Exam December 12, 2017

Reinforcement Learning II

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Open Theoretical Questions in Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Final Exam December 12, 2017

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

MDP Preliminaries. Nan Jiang. February 10, 2019

Lecture 1: March 7, 2018

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Q-learning. Tambet Matiisen

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning and Control

Exercises, II part Exercises, II part

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS599 Lecture 1 Introduction To RL

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Chapter 6: Temporal Difference Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning: An Introduction

Markov Decision Processes Chapter 17. Mausam

Bias-Variance Error Bounds for Temporal Difference Updates

Notes on Reinforcement Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Temporal Difference Learning & Policy Iteration

Markov Decision Processes Chapter 17. Mausam

Probabilistic Planning. George Konidaris

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Trust Region Policy Optimization

Distributed Optimization. Song Chong EE, KAIST

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Introduction

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement learning an introduction

Lecture 3: The Reinforcement Learning Problem

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Grundlagen der Künstlichen Intelligenz

Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

15-780: ReinforcementLearning

Autonomous Helicopter Flight via Reinforcement Learning

Lecture 4: Approximate dynamic programming

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Q-Learning in Continuous State Action Spaces

Reinforcement Learning and NLP

CS Machine Learning Qualifying Exam

Reinforcement Learning

Transcription:

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you greedify q to produce the deterministic policy 0 : 0 (s) =argmaxq (s, a) 8s 2 S. a (a) What do you know about the relationship between and 0? (b) Now suppose you notice that 0 is the same as. What then do you know about the two policies? (c) Now suppose you notice that 0 is di erent from. Do you know anything more about the two policies other than what you reported in part (a)? 1

2. The goal of reinforcement learning can be seen as producing a, which maps from to. 3. From state x, taking action 1 always produces a reward of 2 and sends you to a state y from which a return of 10 is always received. The discount parameter gamma is 0.9. What is v π (y)? What is q * (x,1)? 4. Suppose the discount rate is 0.5 and the following sequence of rewards is observed: R 1 =7, R 2 =6, R 3 =-4, R 4 =4, R 5 =8, R 6 =2, followed by the terminal state. What are the following returns?! G6?! G5?! G4?! G3?! G2?! G1?! G0?

5. Given a choice between two actions, we (should) always pick the one with the larger. a) reward b) return c) value 6. An episodic task begins and ends. A task goes on and on. a) continuous b) discounted c) continuing d) average reward 7. Suppose the discount rate gamma is 0.5 and the following sequence of rewards is observed: R 1 =1, R 2 =6, R 3 =-12, R 4 =16, followed by the terminal state. What are the following returns? G 4? G 3? G 2? G 1? G 0? 8. Suppose the discount rate gamma is 0.5 and the following sequence of rewards is observed: R 1 =1, followed by an infinite sequence of rewards of +13. What are the following returns? G 2? G 1? G 0?

4 Question 9. Give a definition of v in terms of q. Question 10. Give a definition of q in terms of v. Question 11. Give a definition of v in terms of q. Question 12. Give a definition of q in terms of v. Question 13. Give a definition of in terms of q. Question 14. Give a definition of in terms of v.

5 Question 15. Sketch the backup diagrams for the following tabular learning methods: (a) TD(0) (b) One-step Expected Sarsa (c) single-step full backup of v (d) 2-step Tree backup (e) Monte Carlo backup for v

6 Question 16. Write the update that corresponds to the following backup diagrams: (a) (b) (c) (d) (e) s max a r (f)

7 (g) (h) Question 17. For a finite continuing discounted MDP with discount factor, suppose you know two numbers r min and r max such that for all r 2 R, r min apple r apple r max. Give epresssions for two numbers v min and v max such that v min apple v (s) apple v max for all states s 2 S and all policies.

8 Question 18. Markov Decision Processes Consider the MDP in the figure below. There are two states, S1 and S2, and two actions, switch and stay. The switch action takes the agent to the other state with probability 0.8 and stays in the same state with probability 0.2. The stay action keeps the agent in the same state with probability 1. The reward for action stay in state S2 is 1. All other rewards are 0. The discount factor is = 1 2. switch S1 S2 +1 stay switch stay (a) What is the optimal policy? (b) Compute the optimal value function by solving the linear system of equations corresponding to the optimal policy.

9 (c) Suppose that you are doing synchronous value iteration to compute the optimal state-value function. You start with all value estimates equal to 0. Show the value estimates after 1 and 2 iterations respectively. (d) Suppose you are doing TD-learning. You start with all value estimates equal to 0, and you observe the following trajectory (sequence of states, actions and rewards): S1,switch,0,S2,stay,+1,S2 Assuming the learning rate = 0.1, show the TD-updates that are performed.

10 Question 19. From state A, the first action leads deterministically to rewards of 2, 4, and 9 followed by a return to A, whereas the second action leads deterministically to a reward of 3 followed by an immediate return to state A. For what values of is the first action the better action? To solve this you may have to use the formula for solving quadratic equations. Question 20. What is generalized policy iteration? Refer to all three words of the phrase in your explanation.