Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Similar documents
Least Mean Squares Regression

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CS599 Lecture 2 Function Approximation in RL

Least Mean Squares Regression. Machine Learning Fall 2018

CSC321 Lecture 8: Optimization

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSC321 Lecture 22: Q-Learning

Week 5: Logistic Regression & Neural Networks

Reinforcement Learning Part 2

Reinforcement Learning

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

CSC321 Lecture 7: Optimization

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Markov Decision Processes and Solving Finite Problems. February 8, 2017

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Lecture 10 - Planning under Uncertainty (III)

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Day 3 Lecture 3. Optimizing deep networks

Lecture 9: Policy Gradient II 1

Lecture 16: FTRL and Online Mirror Descent

Reinforcement Learning and Control

ECE521 lecture 4: 19 January Optimization, MLE, regularization

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Bayesian Deep Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Chapter 8: Generalization and Function Approximation

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Stochastic Gradient Descent

Math Lecture 18 Notes

Trust Region Policy Optimization

Reinforcement Learning. Machine Learning, Fall 2010

Some AI Planning Problems

Neural Networks in Structured Prediction. November 17, 2015

Lecture 9: Policy Gradient II (Post lecture) 2

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Bias-Variance Tradeoff

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Loss Functions and Optimization. Lecture 3-1

CSC321 Lecture 4 The Perceptron Algorithm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Chapter 5: Integrals

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Stochastic Gradient Descent. CS 584: Big Data Analytics

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CSC411 Fall 2018 Homework 5

Lecture 2: Linear regression

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Classification with Perceptrons. Reading:

Structure Learning: the good, the bad, the ugly

CS260: Machine Learning Algorithms

Comparison of Modern Stochastic Optimization Algorithms

Lecture 4: Approximate dynamic programming

Least squares temporal difference learning

Learning from Data: Regression

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Latent Variable Models

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Q-learning. Tambet Matiisen

Introduction to Machine Learning HW6

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning

1 Review and Overview

Linear Models for Classification

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

A Gentle Introduction to Reinforcement Learning

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Overfitting, Bias / Variance Analysis

CSC321 Lecture 2: Linear Regression

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Generative v. Discriminative classifiers Intuition

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Stat Lecture 20. Last class we introduced the covariance and correlation between two jointly distributed random variables.

Week 3: Linear Regression

CSC321 Lecture 15: Exploding and Vanishing Gradients

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Lecture 6. Notes on Linear Algebra. Perceptron

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Algorithms for MDPs and Their Convergence

CPSC 540: Machine Learning

Response Surface Methods

Lecture 16: Introduction to Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Support vector machines Lecture 4

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

High-dimensional regression

CSC2541 Lecture 5 Natural Gradient

Transcription:

Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine

Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal due 9/26, in two weeks Submission instructions coming soon

Today s Lecture 1. What if we just use a critic, without an actor? 2. Extracting a policy from a value function 3. The Q-learning algorithm 4. Extensions: continuous actions, improvements Goals: Understand how value functions give rise to policies Understand the Q-learning algorithm Understand practical considerations for Q-learning

Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Can we omit policy gradient completely? forget policies, let s just do this! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Policy iteration High level idea: fit a model to estimate return how to do this? generate samples (i.e. run the policy) improve the policy

Dynamic programming 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.7 0.5 just use the current estimate here

Policy iteration with dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.7 0.5

Even simpler dynamic programming approximates the new value! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Fitted value iteration curse of dimensionality fit a model to estimate return generate samples (i.e. run the policy) improve the policy

What if we don t know the transition dynamics? need to know outcomes for different actions! Back to policy iteration can fit this using samples

Can we do the max trick again? forget policy, compute value directly can we do this with Q-values also, without knowing the transitions? doesn t require simulation of actions! + works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient - no convergence guarantees for non-linear function approximation (more on this later)

Fitted Q-iteration

Why is this algorithm off-policy? dataset of transitions Fitted Q-iteration

What is fitted Q-iteration optimizing? most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)

Online Q-learning algorithms fit a model to estimate return generate samples (i.e. run the policy) improve the policy off policy, so many choices here!

Exploration with Q-learning final policy: why is this a bad idea for step 1? epsilon-greedy Boltzmann exploration We ll discuss exploration in more detail in a later lecture!

Review Value-based methods Don t learn a policy explicitly Just learn value or Q-function If we have value function, we have a policy Fitted Q-iteration Batch mode, off-policy method Q-learning Online analogue of fitted Q- iteration generate samples (i.e. run the policy) fit a model to estimate return improve the policy

Break

Value function learning theory 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.7 0.5

Value function learning theory 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.7 0.5

Non-tabular value function learning

Non-tabular value function learning Conclusions: value iteration converges (tabular case) fitted value iteration does not converge not in general often not in practice

What about fitted Q-iteration? Applies also to online Q-learning

But it s just regression! Q-learning is not gradient descent! no gradient through target value

A sad corollary An aside regarding terminology

Review Value iteration theory Linear operator for backup Linear operator for projection Backup is contraction Value iteration converges Convergence with function approximation Projection is also a contraction Projection + backup is not a contraction Fitted value iteration does not in general converge Implications for Q-learning Q-learning, fitted Q-iteration, etc. does not converge with function approximation But we can make it work in practice! Sometimes tune in next time generate samples (i.e. run the policy) fit a model to estimate return improve the policy