Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Similar documents
Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. George Konidaris

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning Part 2

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

CS599 Lecture 1 Introduction To RL

Basics of reinforcement learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Lecture 1: March 7, 2018

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Lecture 23: Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Machine Learning I Reinforcement Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Introduction to Reinforcement Learning

Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning: the basics

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Lecture 7: Value Function Approximation

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning: An Introduction

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning. Introduction

Autonomous Helicopter Flight via Reinforcement Learning

Decision Theory: Q-Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Control

An Introduction to Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning and NLP

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Chapter 3: The Reinforcement Learning Problem

6 Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning Active Learning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Notes on Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Chapter 3: The Reinforcement Learning Problem

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 8: Policy Gradient

Decision Theory: Markov Decision Processes

Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning

, and rewards and transition matrices as shown below:

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Planning in Markov Decision Processes

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Elements of Reinforcement Learning

Temporal difference learning

Lecture 4: Approximate dynamic programming

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Machine Learning I Continuous Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Trust Region Policy Optimization

Lecture 3: Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

Tutorial on Policy Gradient Methods. Jan Peters

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Approximate Dynamic Programming

REINFORCEMENT LEARNING

CS181 Midterm 2 Practice Solutions

Approximate Dynamic Programming

Transcription:

Introduction to Reinforcement Learning CMPT 882 Mar. 18

Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and temporal differencing policy evaluation ε-greedy policy improvement Function Approximation Adaptation of supervised learning to reinforcement learning Policy Gradient

Reinforcement Learning Humans can learn without imitation Given goal/task Try an initial strategy See how well the task is performed Adjust strategy next time Reinforcement learning agent Given goal/task in the form of reward function r s, a Start with initial policy π θ a s ; execute policy Obtain sum of rewards, σ t r s t, a t Improve policy by updating θ, based on rewards

Reinforcement Learning Objective Given: an MDP with state space S, action space A, transition probabilities T, and reward function r s, a Objective: Maximize expected discounted sum of rewards ( return ) maximize π θ E γ t r s t, a t t=0 γ 0,1 : discount factor larger roughly means far-sighted Prioritizes immediate rewards γ < 1 avoids infinite rewards; γ = 1 is possible if all sequences are finite Constraints: now incorporated into the reward function Only constraint (usually implicit): subject to transition matrix T (system dynamics)

RL vs. Other ML Paradigms No supervisor But we will often draw inspiration from supervised learning Sequential data in time Reward feedback is obtained after a long time Many actions combined together will receive reward Actions are dependent on each other In robotics: lack of data

Reinforcement Learning Categories Model-based Explicitly involves an MDP model Model-free Does not explicitly involve an MDP model Value based Learns value function, and derives policy from value function Policy based Learns policy without value function Actor critic Incorporates both value function and policy

Value Functions State-value function : V π s -- expected return starting from state s and following policy π V π s = E at ~π σ t=0 γ t r s t, a t s 0 = s Expectation is on the random sequence s 0, a 0, s 1, a 1, Action-value function, or Q function : Q π s, a -- expected return starting from state s, taking action a, and then following policy π Q π s, a = E at ~π σ t=0 γ t r s t, a t s 0 = s, a 0 = a

Principal of Optimality Applied to RL J b1 d Optimal discounted sum of rewards: V π s = max π Dynamic programming: E σ t=0 γ t r s t, a t s 0 = s V π s = max E r s t, a t + γv π s t+1 s t = s a t Q π s, a = E r s t, a t + γv π s t+1 s t = s, a t = a b 1 ሚJ ab1 b2 ሚJ ab2 ሚJ ab3 b 3 J b2 d J b3 d Actually, recurrence is true even without maximization V π s = E r s t, a t + γv π s t+1 s t = s Q π s, a = E r s t, a t + γv π s t+1 s t = s, a t = a

Basic Properties of Value Functions V π s = max π V π s Q π s, a = max π Q π s, a V π s = max a Q π s, a For now, value functions are stored in multi-dimensional arrays DP leads to deterministic policies we will come back to stochastic policies

J b1 d Optimizing the RL Objective via DP b 1 J b2 d ሚJ ab1 b2 State-value function V π s = max E r s t, a t + γv s t+1 s t = s a t V π s = max r s, a + γe V s t+1 s t = s a V π s = max r s, a + γ σ a s p s s, a V π s Bellman backup : V s max r s, a + γ σ s p s s, a V s This is done for all s Iterate until convergence a ሚJ ab2 ሚJ ab3 b 3 J b3 d Optimal policy: a = arg max a r s, a + γ σ s p s s, a V s Deterministic

Optimizing the RL Objective via DP Action-value function Q π s, a = E r s t, a t + γ max Q π s t+1, a t+1 s t = s, a t = a a t+1 Q π s, a = r s, a + γe max Q π s t+1, a t+1 s t = s, a t = a a t+1 Q π s, a = r s, a + γ σ s p s s, a V π s Bellman backup : V s max Q s, a a Q s, a r s, a + γ σ s p s s, a V s This is done for all s and all a Iterate until convergence Optimal policy: a = arg max a Q s, a Deterministic ሚJ ab1 b 1 b 2 ሚJ ab2 ሚJ ab3 b 3 J b1 d J b2 d J b3 d

Approximate Dynamic Programming Use a function approximator (eg. neural network) V s; w, where w are weights, to approximate V V s is no longer stored at every state Weights w are updated using Bellman backups Basic algorithm: (We will learn about other variants too) Sample some states, s i For each s i, generate V s i = max r s, a + γ σ a s p s s t, a V s ; w Using s i, V s i, update weights w via regression (supervised learning)

Generalized Policy Evaluation and Policy Improvement DP Start with initial policy π and value function V or Q Use policy π to update V: a = π s V s r s, a + γ σ s p s s, a V s Q s, a r s, a + γ σ s p s s, a V s In general, any policy evaluation algorithm policy improvement algorithm π Q DP Use V or Q to update policy π: Given V s, π s = arg max r s, a + γ σ a s p s s, a V s Given Q s, a, π s = arg max Q s, a a In general, any policy improvement algorithm policy evaluation algorithm

Convergence At convergence, the following are simultaneously satisfied: V s = r s, a + γ σ s p s s t, a V s π s = arg max a r s, a + γ σ s p s s t, a V s policy improvement algorithm This is the principle of optimality π Q Therefore, the value function and policy are optimal policy evaluation algorithm

Terminology Value iteration : The process of iteratively updating value function With DP, we only need to keep track of value function V or Q, and the policy π is implicit determined from value function Policy iteration : The process of iteratively updating policy This is done implicitly with Bellman backups Greedy policy : the policy obtained from choosing the best action based on the current value function If the value function is optimal, the greedy policy is optimal

Towards Model-Free Learning Policy evaluation Monte-Carlo (MC) Sampling Temporal-difference (TD) Policy improvement ε-greedy policies

Monte-Carlo Policy Evaluation Start with initial policy π and value function V or Q Use policy π to update V: a = π s Apply π to obtain trajectory s 0, a 0, s 1, a 1, Compute return: R σγ t r s t, a t Repeat for many episodes to obtain empirical mean Episode : a single try that produces a single trajectory Use V or Q to update policy π policy improvement algorithm π Q policy evaluation algorithm

Monte-Carlo Policy Evaluation To obtain empirical mean, we record N s, # of times s is visited for every state Start at N s = 0 for all s Note that this means storing N (and S below) at every state First-visit MC Policy Evaluation: At the first time t that s is visited in an episode, Increment N s N s + 1 Record return S s S s + σγ t r s t, a t Repeat for many episodes Estimate value: V s = S s N s

Monte-Carlo Policy Evaluation To obtain empirical mean, we record N s, # of times s is visited for every state Start at N s = 0 for all s Note that this means storing N (and S below) at every state Every-visit MC Policy Evaluation: Every time t that s is visited in an episode, Increment N s N s + 1 Record return S s S s + σγ t r s t, a t Repeat for many episodes Estimate value: V s S s N s

Incremental Updates Instead of estimating V π s after many episodes, we can update it incrementally after every episode after receiving return R N s N s + 1 V s V s + 1 N s R V s More generally, we can weight the second term differently V s V s + α R V s

Monte-Carlo Policy Evaluation Start with initial policy π and value function V or Q Use policy π to update V: a = π s MC policy evaluation provides estimate of V π Many episodes are needed to obtain accurate estimate Model-free with MC! Use V or Q to update policy π Greedy policy? policy improvement algorithm π Q policy evaluation algorithm

Monte-Carlo Policy Evaluation Start with initial policy π and value function V or Q Use policy π to update V: a = π s MC policy evaluation provides estimate of V π Many episodes are needed to obtain accurate estimate Model-free with MC! policy improvement algorithm π Q Use V or Q to update policy π Greedy policy? Greedy policy lacks exploration, so V π is not estimated at many states ε-greedy policy policy evaluation algorithm

ε-greedy Policy Also known as ε-greedy exploration Choose random action with probability ε Typically uniformly random If a takes on discrete values, then all actions will be chosen eventually Choose action from greedy policy with probability 1 ε a = arg max a r s, a + γ σ s p s s t, a V s Still requires model, p s s t, a Solution: Q function

Monte-Carlo Policy Evaluation To obtain empirical mean, we record N s, a, # of times s is visited for every state Start at N s, a = 0 for all s and a Note that this means N (and S below) must be stored for every s and a First-visit MC Policy Evaluation: At the first time t that s is visited in an episode, Increment N s, a N s, a + 1 Record return S s, a S s, a + σγ t r s t, a t Repeat for many episodes Estimate action-value function: Q s, a = S s,a N s,a

Incremental Updates Instead of estimating V s after many episodes, we can update it incrementally after every episode after receiving return R N s, a N s, a + 1 Q s, a Q s, a + 1 N s,a R Q s, a More generally, we can weight the second term differently Q s, a Q s, a + α R Q s, a

Monte-Carlo Policy Evaluation Start with initial policy π and value function V or Q Use policy π to update Q: a = π s MC policy evaluation provides estimate of Q π Many episodes are needed to obtain accurate estimate Model-free with MC! policy improvement algorithm π Q Use V or Q to update policy π Greedy policy? Greedy policy lacks exploration, so V is not estimated at many states ε-greedy policy policy evaluation algorithm

ε-greedy Policy Also known as ε-greedy exploration Choose random action with probability ε Typically uniformly random If a takes on discrete values, then all actions will be chosen eventually Choose action from greedy policy with probability 1 ε a = arg max a Q s, a Model-free!