CSC321 Lecture 22: Q-Learning

Similar documents
Introduction to Reinforcement Learning

Reinforcement Learning Part 2

Decision Theory: Q-Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

CS230: Lecture 9 Deep Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Q-Learning in Continuous State Action Spaces

Trust Region Policy Optimization

Reinforcement Learning and NLP

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Decision Theory: Markov Decision Processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Deep Reinforcement Learning: Policy Gradients and Q-Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning and Control

(Deep) Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

REINFORCEMENT LEARNING

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 10 - Planning under Uncertainty (III)

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Markov Decision Processes

Reinforcement Learning. Introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Machine Learning, Fall 2010

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 9: Policy Gradient II 1

Real Time Value Iteration and the State-Action Value Function

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

15-780: ReinforcementLearning

Policy Gradient Methods. February 13, 2017

Reinforcement Learning

Reinforcement learning an introduction

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS 188: Artificial Intelligence

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CS 598 Statistical Reinforcement Learning. Nan Jiang

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Approximate Q-Learning. Dan Weld / University of Washington

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

16.410/413 Principles of Autonomy and Decision Making

Planning in Markov Decision Processes

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning via Policy Optimization

Q-learning. Tambet Matiisen

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Deep Reinforcement Learning

Human-level control through deep reinforcement. Liia Butler

David Silver, Google DeepMind

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning and Deep Reinforcement Learning

arxiv: v1 [cs.lg] 20 Jun 2018

Reinforcement Learning. George Konidaris

CS181 Midterm 2 Practice Solutions

CSC321 Lecture 20: Reversible and Autoregressive Models

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

CSC321 Lecture 7: Optimization

Notes on Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Internet Monetization

Reinforcement Learning II. George Konidaris

Reinforcement Learning

Lecture 1: March 7, 2018

Approximation Methods in Reinforcement Learning

Reinforcement Learning

The Reinforcement Learning Problem

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning II. George Konidaris

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning for NLP

Lecture 3: Markov Decision Processes

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

, and rewards and transition matrices as shown below:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning I Continuous Reinforcement Learning

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Multiagent (Deep) Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

CSC321 Lecture 8: Optimization

Lecture 8: Policy Gradient

Transcription:

CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21

Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize a policy directly, don t represent anything about the environment Today: Q-learning Learn an action-value function that predicts future returns Next time: AlphaGo uses both a policy network and a value network This lecture is review if you ve taken 411 This lecture has more new content than I d intended. If there is an exam question about this lecture or next one, it won t be a hard question. Roger Grosse CSC321 Lecture 22: Q-Learning 2 / 21

Overview Agent interacts with an environment, which we treat as a black box Your RL code accesses it only through an API since it s external to the agent I.e., you re not allowed to inspect the transition probabilities, reward distributions, etc. Roger Grosse CSC321 Lecture 22: Q-Learning 3 / 21

Recap: Markov Decision Processes The environment is represented as a Markov decision process (MDP) M. Markov assumption: all relevant information is encapsulated in the current state Components of an MDP: initial state distribution p(s 0 ) transition distribution p(s t+1 s t, a t ) reward function r(s t, a t ) policy π θ (a t s t ) parameterized by θ Assume a fully observable environment, i.e. s t can be observed directly Roger Grosse CSC321 Lecture 22: Q-Learning 4 / 21

Finite and Infinite Horizon Last time: finite horizon MDPs Fixed number of steps T per episode Maximize expected return R = E p(τ) [r(τ)] Now: more convenient to assume infinite horizon We can t sum infinitely many rewards, so we need to discount them: $100 a year from now is worth less than $100 today Discounted return G t = r t + γr t+1 + γ 2 r t+2 + Want to choose an action to maximize expected discounted return The parameter γ < 1 is called the discount factor small γ = myopic large γ = farsighted Roger Grosse CSC321 Lecture 22: Q-Learning 5 / 21

Value Function Value function V π (s) of a state s under policy π: the expected discounted return if we start in s and follow π V π (s) = E[G t s t = s] [ ] = E γ i r t+i s t = s i=0 Computing the value function is generally impractical, but we can try to approximate (learn) it The benefit is credit assignment: see directly how an action affects future returns rather than wait for rollouts Roger Grosse CSC321 Lecture 22: Q-Learning 6 / 21

Value Function Rewards: -1 per time step Undiscounted (γ = 1) Actions: N, E, S, W State: current location Roger Grosse CSC321 Lecture 22: Q-Learning 7 / 21

Value Function Roger Grosse CSC321 Lecture 22: Q-Learning 8 / 21

Action-Value Function Can we use a value function to choose actions? arg max r(s t, a) + γe p(st+1 s a t,a t)[v π (s t+1 )] Roger Grosse CSC321 Lecture 22: Q-Learning 9 / 21

Action-Value Function Can we use a value function to choose actions? arg max r(s t, a) + γe p(st+1 s a t,a t)[v π (s t+1 )] Problem: this requires taking the expectation with respect to the environment s dynamics, which we don t have direct access to! Instead learn an action-value function, or Q-function: expected returns if you take action a and then follow your policy Relationship: Q π (s, a) = E[G t s t = s, a t = a] V π (s) = a π(a s)q π (s, a) Optimal action: arg max Q π (s, a) a Roger Grosse CSC321 Lecture 22: Q-Learning 9 / 21

Bellman Equation The Bellman Equation is a recursive formula for the action-value function: Q π (s, a) = r(s, a) + γe p(s s,a) π(a s )[Q π (s, a )] There are various Bellman equations, and most RL algorithms are based on repeatedly applying one of them. Roger Grosse CSC321 Lecture 22: Q-Learning 10 / 21

Optimal Bellman Equation The optimal policy π is the one that maximizes the expected discounted return, and the optimal action-value function Q is the action-value function for π. The Optimal Bellman Equation gives a recursive formula for Q : [ ] Q (s, a) = r(s, a) + γe p(s s,a) max Q (s t+1, a ) s t = s, a t = a a This system of equations characterizes the optimal action-value function. So maybe we can approximate Q by trying to solve the optimal Bellman equation! Roger Grosse CSC321 Lecture 22: Q-Learning 11 / 21

Q-Learning Let Q be an action-value function which hopefully approximates Q. The Bellman error is the update to our expected return when we observe the next state s. r(s t, a t ) + γ max Q(s t+1, a) a }{{} inside E in RHS of Bellman eqn Q(s t, a t ) The Bellman equation says the Bellman error is 0 in expectation Q-learning is an algorithm that repeatedly adjusts Q to minimize the Bellman error Each time we sample consecutive states and actions (s t, a t, s t+1 ): [ ] Q(s t, a t ) Q(s t, a t ) + α r(s t, a t ) + γ max Q(s t+1, a) Q(s t, a t ) a }{{} Bellman error Roger Grosse CSC321 Lecture 22: Q-Learning 12 / 21

Exploration-Exploitation Tradeoff Notice: Q-learning only learns about the states and actions it visits. Exploration-exploitation tradeoff: the agent should sometimes pick suboptimal actions in order to visit new states and actions. Simple solution: ɛ-greedy policy With probability 1 ɛ, choose the optimal action according to Q With probability ɛ, choose a random action Believe it or not, ɛ-greedy is still used today! Roger Grosse CSC321 Lecture 22: Q-Learning 13 / 21

Exploration-Exploitation Tradeoff You can t use an epsilon-greedy strategy with policy gradient because it s an on-policy algorithm: the agent can only learn about the policy it s actually following. Q-learning is an off-policy algorithm: the agent can learn Q regardless of whether it s actually following the optimal policy Hence, Q-learning is typically done with an ɛ-greedy policy, or some other policy that encourages exploration. Roger Grosse CSC321 Lecture 22: Q-Learning 14 / 21

Q-Learning Roger Grosse CSC321 Lecture 22: Q-Learning 15 / 21

Function Approximation So far, we ve been assuming a tabular representation of Q: one entry for every state/action pair. This is impractical to store for all but the simplest problems, and doesn t share structure between related states. Solution: approximate Q using a parameterized function, e.g. linear function approximation: Q(s, a) = w ψ(s, a) compute Q with a neural net Update Q using backprop: t r(s t, a t ) + γ max Q(s t+1, a) a θ θ + α(t Q(s, a)) Q θ Roger Grosse CSC321 Lecture 22: Q-Learning 16 / 21

Function Approximation Approximating Q with a neural net is a decades-old idea, but DeepMind got it to work really well on Atari games in 2013 ( deep Q-learning ) They used a very small network by today s standards Main technical innovation: store experience into a replay buffer, and perform Q-learning using stored experience Gains sample efficiency by separating environment interaction from optimization don t need new experience for every SGD update! Roger Grosse CSC321 Lecture 22: Q-Learning 17 / 21

Atari Mnih et al., Nature 2015. Human-level control through deep reinforcement learning Network was given raw pixels as observations Same architecture shared between all games Assume fully observable environment, even though that s not the case After about a day of training on a particular game, often beat human-level performance (number of points within 5 minutes of play) Did very well on reactive games, poorly on ones that require planning (e.g. Montezuma s Revenge) https://www.youtube.com/watch?v=v1eynij0rnk https://www.youtube.com/watch?v=4mlzncshy1q Roger Grosse CSC321 Lecture 22: Q-Learning 18 / 21

Wireheading If rats have a lever that causes an electrode to stimulate certain reward centers in their brain, they ll keep pressing the lever at the expense of sleep, food, etc. RL algorithms show this wireheading behavior if the reward function isn t designed carefully https://blog.openai.com/faulty-reward-functions/ Roger Grosse CSC321 Lecture 22: Q-Learning 19 / 21

Policy Gradient vs. Q-Learning Policy gradient and Q-learning use two very different choices of representation: policies and value functions Advantage of both methods: don t need to model the environment Pros/cons of policy gradient Pro: unbiased estimate of gradient of expected return Pro: can handle a large space of actions (since you only need to sample one) Con: high variance updates (implies poor sample efficiency) Con: doesn t do credit assignment Pros/cons of Q-learning Pro: lower variance updates, more sample efficient Pro: does credit assignment Con: biased updates since Q function is approximate (drinks its own Kool-Aid) Con: hard to handle many actions (since you need to take the max) Roger Grosse CSC321 Lecture 22: Q-Learning 20 / 21

Actor-Critic (optional) Actor-critic methods combine the best of both worlds Fit both a policy network (the actor ) and a value network (the critic ) Repeatedly update the value network to estimate V π Unroll for only a few steps, then compute the REINFORCE policy update using the expected returns estimated by the value network The two networks adapt to each other, much like GAN training Modern version: Asynchronous Advantage Actor-Critic (A3C) Roger Grosse CSC321 Lecture 22: Q-Learning 21 / 21