CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Similar documents
Lecture 23: Reinforcement Learning

CS599 Lecture 2 Function Approximation in RL

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Lecture 7: Value Function Approximation

Reinforcement Learning

Lecture 1: March 7, 2018

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Temporal difference learning

Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Lecture 8: Policy Gradient

Approximate Dynamic Programming

Notes on Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning II. George Konidaris

Machine Learning I Continuous Reinforcement Learning

Approximate Dynamic Programming

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning II. George Konidaris

Grundlagen der Künstlichen Intelligenz

Chapter 8: Generalization and Function Approximation

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Machine Learning I Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 4: Approximate dynamic programming

Replacing eligibility trace for action-value learning with function approximation

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Generalization and Function Approximation

Reinforcement Learning

Reinforcement Learning: An Introduction. ****Draft****

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement learning

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II 1

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement learning an introduction

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

A Gentle Introduction to Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning: the basics

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

An Introduction to Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

15-780: ReinforcementLearning

Elements of Reinforcement Learning

Introduction. Reinforcement Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning II

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Deep Reinforcement Learning

Reinforcement Learning

Reinforcement Learning (1)

An Introduction to Reinforcement Learning

REINFORCEMENT LEARNING

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement learning

Reinforcement Learning and Control

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Reinforcement Learning

Dual Temporal Difference Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning: An Introduction

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 8: Policy Gradient I 2

Olivier Sigaud. September 21, 2012

ilstd: Eligibility Traces and Convergence Analysis

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

The Application of Markov Chain in the Realization of AI of Draughts

Reinforcement Learning Part 2

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning as Classification Leveraging Modern Classifiers

State Space Abstractions for Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

On the Convergence of Optimistic Policy Iteration

Introduction to Reinforcement Learning

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning in a Nutshell

CS230: Lecture 9 Deep Reinforcement Learning

Trust Region Policy Optimization

Transcription:

CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X. Yan, P. Diaconis, P. Rusmevichientong, and B. Van Roy, ``Solitaire: Man Versus Machine,'' Advances in Neural Information Processing Systems 17, MIT Press, 2005. Page 1

Recap RL so far When model is available: VI, PI, GPI, LP When model is not available: Model-based RL: collect data, estimate model, run one of the above methods for estimated model Model-free RL: learn V, Q directly from experience: TD(λ), sarsa(λ): on policy updates Q: off policy updates What about large MDPs for which we cannot represent all states in memory or cannot collect experience from all states? Function Approximation Generalization and function approximation Represent the value function using a parameterized function V θ (s), e.g.: Neural network: θ is a vector with the weights on the connections in the network Linear function approximator: V θ (s) = θ φ(s) Radial basis functions φ i (s)=exp ( 1 2 (s s i) Σ 1 (s s i ) ) Tilings: (often multiple) partitions of the state space Polynomials: Fourier basis φ i (s)=x ji 1 1 x ji 2 2... x ji n n {1,sin(2π x 1 L 1 ),cos(2π x 1 L 1 ),sin(2π x 2 L 2 ),cos(2π x 2 L 2 ),...} [Note: most algorithms kernelizable] Often also specifically designed for the problem at hand Page 2

Example: tetris state: board configuration + shape of the falling piece ~2 200 states! action: rotation and translation applied to the falling piece V(s)= 22 i=1 θ iφ i (s) 22 features aka basis functions φ i Ten basis functions, 0,..., 9, mapping the state to the height h[k] of each of the ten columns. Nine basis functions, 10,..., 18, each mapping the state to the absolute difference between heights of successive columns: h[k+1] h[k], k = 1,..., 9. One basis function, 19, that maps state to the maximum column height: max k h[k] One basis function, 20, that maps state to the number of holes in the board. One basis function, 21, that is equal to 1 in every state. [Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)] Objective A standard way to find θ in supervised learning, optimize MSE: min θ s P(s)(V(s) V θ(s)) 2 Is this the correct objective? Page 3

Evaluating the objective When performing policy evaluation, we can obtain samples by simply executing the policy and recording the empirical (discounted) sum of rewards min θ encountered states s(ˆvπ (s) V π θ (s) ) 2 TD methods: use substitute for V π (s) v t = r t + γv π θ (s t+1) as a Stochastic gradient descent Stochastic gradient descent to optimize MSE objective: Iterate Draw a state s according to P Update: θ θ 1 2 α θ(v π (s) V π θ (s))2 = θ+α(v π (s) V π θ (s)) θv π θ (s) TD(0): use v t = r t + γv π θ (s t+1) as a substitute for V π (s) θ t+1 θ t + α ( R(s t, a t, s t+1 )+γv π θ t (s t+1 ) V π θ t (s t ) ) θ V π θ t (s t ) Page 4

TD(λ) with function approximation time t: θ t+1 θ t + α ( R(s t, a t, s t+1 )+γvθ π t (s t+1 ) Vθ π t (s t ) ) π θ Vθ }{{} t (s t ) }{{} δ t e t time t+1: ( ) θ t+2 θ t+1 +α R(s t+1, a t+1, s t+2 )+γvθ π t+1 (s t+2 ) Vθ π t+1 (s t+1 ) θ Vθ π t+1 (s t+1 ) }{{} δ t+1 θ t+2 θ t+2 + αδ t+1 γλe t [ improving previous update ] Combined: δ t+1 = R(s t+1, a t+1, s t+2 )+γv π θ (s t+2 ) V π θ t+1 (s t+1 ) e t+1 = γλe t + θt+1 V π θ t+1 (s t+1 ) θ t+2 = θ t+1 + αδ t+1 e t+1 TD(λ) with function approximation Can similarly adapt sarsa(λ) and Q(λ) eligibility vectors for function approximation Page 5

Guarantees Monte Carlo based evaluation: Provides unbiased estimates under current policy Will converge to true value of current policy Temporal difference based evaluations: TD(λ) w/linear function approximation: [Tsitsiklis and Van Roy, 1997] ***If*** samples are generated from traces of execution of the policy π, and for appropriate choice of step-sizes α, TD(λ) converges and at convergence: [D = expected discounted state visitation frequencies under policy π] V θ V D 1 λγ 1 γ Π DV V D Sarsa(λ) w/linear function approximation: same as TD Q w/linear function approximation: [Melo and Ribeiro, 2007] Convergence to reasonable Q value under certain assumptions, including: s, a φ(s, a) 1 1 [Could also use infinity norm contraction function approximators to attain convergence --- see earlier lectures. However, this class of function approximators tends to be more restrictive.] Off-policy counterexamples Baird s counterexample for off policy updates: Page 6

Off-policy counterexamples Tsitsiklis and Van Roy counterexample: complete backup with off-policy linear regression [i.e., uniform least squares, rather than waited by state visitation rates] Intuition behind TD(0) with linear function approximation guarantees Stochastic approximation of the following operations: Back-up: (T π V)(s)= s T(s, π(s), s )[R(s, π(s), s )+γv(s )] Weighted linear regression: with solution: Φθ=Φ(Φ DΦ) 1 Φ D(T π V) }{{} Π D min θ s D(s)((Tπ V)(s) θ φ(s)) 2 Key observations: V 1, V 2 : T π V 1 T π V 2 D γ V 1 V 2 D, here: x D = D(i)x(i) 2 V 1, V 2 : Π D V 1 Π D V 2 D V 1 V 2 D i Page 7

Intuition behind TD(λ) guarantees Bellman operator: (T π J)(s)= s P(s s, π(s))[g(s)+γj(s )]=E[g(s)+γJ(s )] T λ operator: T λ J(s)=(1 λ) m=0 λm E [ m k=0 γt g(s k )+γ m+1 J(s m+1 ) ] T λ operator is contraction w.r.t. \ \ _D for all λ [0,1] Should we use TD than well Monte Carlo? At convergence: V θ V D 1 λγ 1 γ Π DV V D Page 8

Empirical comparison (See, Sutton&Barto p.221 for details) Backgammon 15 pieces, try go reach other side Move according to roll of dice If hitting an opponent piece: it gets reset to the middle row Cannot hit points with two or more opponent pieces Page 9

Backgammon 30 pieces, 24+2 possible locations For typical state and dice roll: often 20 moves TD Gammon [Tesauro 92,94,95] Reward = 1 for winning the game = 0 other states Function approximator: 2 layer neural network Page 10

Input features For each point on the backgammon board, 4 input units indicate the number of white pieces as follows: 1 piece unit1=1; 2 pieces unit1=1, unit2=1; 3 pieces unit1=1, unit2=1, unit3=1; n>3 pieces unit1=1, unit2=1, unit3=1, unit4 = (n-3)/2 Similarly for black [This already makes for 2*4*24 = 192 input units.] Last six: number of pieces on the bar (w/b), number of pieces that completed the game (w/b), white s move, black s move Neural net φ(i), i=1,...,198 Each hidden unit computes: h(j)=σ( i w ijφ(i))= 1 1+exp( i w ijφ(i)) Output unit computes: o=σ( j w 1 jh(j))= ( 1+exp j w jh(j) ) Overall: o=f(φ(1),..., φ(198); w) Page 11

Neural nets Popular at that time for function approximation / learning in general Derivatives/Gradients are easily derived analytically Turns out they can be computed through backward error propagation --- name error backpropagation Susceptible to local optima! Learning Initialize weights randomly TD(λ) [λ = 0.7, α = 0.1] Source of games: self-play, greedy w.r.t. current value function [results suggest game has enough stochasticity built in for exploration purposes] Page 12

Results After 300,000 games as good as best previous computer programs Neurogammon: highly tuned neural network trained on large corpus of exemplary moves TD Gammon 1.0: add Neurogammon features Substantially better than all previous computer players; human expert level TD Gammon 2.0, 2.1: selective 2-ply search TD Gammon 3.0: selective 3-ply search, 160 hidden units Page 13