Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Similar documents
CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Lecture 23: Reinforcement Learning

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Chapter 8: Generalization and Function Approximation

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 1: March 7, 2018

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Georey J. Gordon. Carnegie Mellon University. Pittsburgh PA Bellman-Ford single-destination shortest paths algorithm

CS599 Lecture 2 Function Approximation in RL

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning I Reinforcement Learning

Lecture 7: Value Function Approximation

MDP Preliminaries. Nan Jiang. February 10, 2019

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Generalization and Function Approximation

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Deep Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Reinforcement Learning

Lecture 5: Logistic Regression. Neural Networks

(Deep) Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Basics of reinforcement learning

Neural networks. Chapter 20. Chapter 20 1

Planning in Markov Decision Processes

CSC321 Lecture 8: Optimization

Introduction to Reinforcement Learning

Reinforcement learning an introduction

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Lecture 8: Policy Gradient

Lecture 4: Approximate dynamic programming

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

y(x n, w) t n 2. (1)

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Decision making, Markov decision processes

Be able to define the following terms and answer basic questions about them:

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Q-Learning in Continuous State Action Spaces

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 7: Optimization

Chapter 3: The Reinforcement Learning Problem

Do not tear exam apart!

Introduction to Reinforcement Learning

4. Multilayer Perceptrons

Multi-Player Residual Advantage Learning With General Function Approximation

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Internet Monetization

Reinforcement Learning. Up until now we have been

Reinforcement Learning

Learning Tetris. 1 Tetris. February 3, 2009

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Average Reward Parameters

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

Reinforcement Learning (1)

Keywords: machine learning, reinforcement learning, dynamic programming, Markov decision processes (MDPs), linear programming, convex programming, fun

Temporal difference learning

Reinforcement Learning in Continuous Time and Space

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Prioritized Sweeping Converges to the Optimal Value Function

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 3: The Reinforcement Learning Problem

Machine Learning I Continuous Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

On the Convergence of Optimistic Policy Iteration

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning

Neural networks and optimization

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Data Mining Part 5. Prediction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

An Adaptive Clustering Method for Model-free Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Transcription:

Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Backgammon Object: move all pieces o rst Opponent can block you or send you backwards

Backgammon cont'd 30 pieces (5 each for you and opponent) 24 points + bar + o = 26 locations 2 dice (36 possible rolls) Rough (over)estimate: 30 26 36 2 :8 0 40 states Branching factor: 36 about 20 Complications: gammon, backgammon, doubling cube

TD-Gammon [Tesauro] Neural network with hidden layer 200 inputs: board position 40-80 hidden units output: win probability Sigmoids on all hidden, output units (range [0 ])

>< >: win loss 0 Using RL Win probability is value function if: 8 Reinforcement = 0 game not over Discount is = weights for network should approximately satisfy Bellman So equation Warning: devil is in details! For now: ignore details, use TD(0)

Write v(xjw) for output of net with inputs x and weights w Given a transition x a r y Update w := w old + (r + v(yjw old ) ; v(xjw old )) r w v(xjw)j wold Review of TD(0) Ignores a (assumes it was chosen greedily)

TD(0) is not gradient descent Max isn't dierentiable Haven't said how to choose transition (won't be independent) Step direction is not a full gradient

Write Error(x r yjw) =(r + v(yjw) ; v(xjw)) 2 Write Error(x r yjw w 2 )=(r + v(yjw 2 ) ; v(xjw )) 2 TD uses r w Error(x r yjw w old )j wold GD would use r w Error(x r yjw)j wold Bellman error

Raw: Are there at least k of my pieces at location x? Representation Raw board or raw + expert features Are there at least k opposing pieces at location x? Is it my turn? My opponent's? For k>k max, extra pieces added to last unit's activation k max is 4 for points, for bar and o All inputs scaled to lie approximately in [0 ]

A useful trick position+die roll position position+die roll Die rolls aren't represented Implicit in lookahead choose random

Training Start from zero knowledge Self play (up to.5 million games) Data: state, die roll, action, reinforcement, state,... Compress to: state, state,..., state, win/loss

Results features give pretty good player (close to best computer Raw at time) player Raw+expert features give world-class player Absolute probabilities not so hot (0% error) Relative probabilities very good

What's easy about RL in backgammon Random strong eective discount kick out of local minima Guaranteed to terminate Perfect model, no hidden state Self-play allows huge training set

function approximator Nonlinear usual worry of local minima data distribution Changing changing policy causes change in importance of states What's hard about RL in backgammon Game (minimax instead of min or max) complete divergence is also possible [Van Roy] could get stuck (checkers, go) could oscillate forever wrong distribution could cause divergence [Gordon, Baird]

Huge variety of approximate RL algorithms Incremental vs. batch Model-based vs. direct Kind of function approximation Control vs. estimation State-values vs. action-values (V vs. Q) Vs. no values (policy only)

One fundamental question What does \approximate solution to Bellman equations" mean? Subquestion: how do we nd one?

Possible answers Bellman equations are a red herring Minimum squared Bellman error Minimum weighted squared Bellman error Relaxation of Bellman equations satisfy at some states other

Fit a line to each row: v(i j)=a j + b j i Minimum SBE example 3 00 grid of states

Minimum SBE example cont'd 00 00 00 80 80 80 20 40 60 40 60 40 60 00 20 20 75 50 25 0.5 2.53 2 2 0 - -2.5 2.5 2 3 00 75 50 25 0.5.52 2.53 2 Exact Min SBE Min weighted SBE

action a Fixed policy Measures how important (x a) is to value of Flows F (x a) = expected # of times we visit state x and perform Fixed starting frequencies f 0 (x)

Visit (s a) at steps 2 and 7, discounted ow is 2 + 7 Flows cont'd If <, use discounted ows For all, F satises conservation equation 0 X A F (x a) = @ f0 (x)+ X y a F (y a)p (xjy a) a

F = F Aside: Can nd optimal ows using nonnegativity, conservation, and of reinforcement maximization This is dual (in sense of LP) to solving Bellman equations Optimal ows are optimal ows Optimal ows determine optimal policy maximizes total expected discounted reinforcement F X E(r(x a))f (x a) x a

Choosing weights Optimal ows would be good weights If just one policy, can nd ows Otherwise, chicken and egg problem Iterative improvement? Yes, but none known to converge. Research question: EM algorithm?

ij is probability of going to state j from state i, r i is expected p reinforcement state is i Matrix form of TD(0) Assume one policy, n states Write P for transition probability matrix (n n) Write r for reinforcement vector (n ) P i (ith row of P ) is distribution of next state given that current

Write E = P ; I Matrix form of TD(0) cont'd Write v for value function (n ) v i is value of state i (Pv + r) ; v is vector of Bellman errors Bellman equation is (P ; I)v + r =0

Assume w = 0 corresponds to v(x) = 0 for all x Matrix form of TD(0) cont'd Assume v is linear in parameters w (k ) r w v i is a constant k-vector (call it A i ) r w v is a constant n k matrix (call it A)

= X x = X x = X x = X x P (x)a x (r x + w X y Expected update E((r + v(yjw) ; v(xjw))r w v(xjw)) P (x)e((r + v(yjw) ; v(xjw))a x jx) P (x)a x (E(rjx)+E(v(yjw x)) ; v(xjw)) P (x)a x (r x + E(A y wjx) ; A x w) P (yjx)a y ; A x w)

P (x)a x (r x + w X y = X x = X x = X x Expected update cont'd X P (yjx)a y ; A x w) x P (x)a x (r x + w (P T x A) ; A x w) P (x)a x (r x +(P T x A ; A x) w) P (x)a x (r x +(P x ; U x ) T A w) = (DA) T (R +(P ; I)Aw)

Data x r x 2 r 2 ::: TD(0) uses -step error (r + x 2 ) ; x Could use 2-step error (r + r 2 + 2 x 3 ) ; x TD() weights k-step error proportional to k TD() As!, approaches supervised learning

So (I ; A T DEA) has radius < for small enough of expectation follows tricky stochastic argument Convergence show that random perturbations don't kill convergence [Dayan, to Convergence Expected update is A T D(EAw + R) Can write w := (I ; A T DEA)w old + k +error Sutton (988) proved that A T DEA is positive denite So 9 a norm in which (I ; A T DEA) is a contraction Jaakola, Jordan, Singh, Tsitsiklis, Van Roy]

TD(0) and min weighted SBE TD(0) does not nd minimum of weighted SBE, but close TD(0) solves A T D(EAw + R) =0 Min WSBE satises (DEA) T D(EAw + R) =0 solving TD eqs in batch mode is called LS-TD (for least Note: even though it's not minimizing squared anything) [Bradtke squares, & Barto]

If k parameters, might expect m = k Collocation Satisfy Bellman equations exactly at m states But Bellman equations are nonlinear, so in general might need more or fewer than k equations solution might not be unique hard to nd

Averagers For some function approximators [Gordon, Van Roy] m = k works solution is unique easy to nd Called averagers linear interpolation on mesh, multilinear interpolation Examples: grid, k-nearest-neighbor on

Pick a sample X Remember v(x) only for x 2 X Fitted value iteration nd v(x) for x 62 X use k-nearest-neighbor, linear interpolation, To... One step of tted VI has two parts: Do a step of exact VI Replace result with a cheap approximation

Approximate with v(x) = v(y) = w Approximators as mappings v(y) (,5) (3,3) v(x) MDP with 2 states, so value fn is 2D: (v(x) v(y))

7! M A 7! M A Approximators as mappings 4 3 2 4 3 2 2 3 4 5 6 2 3 4 5 6 4 3 2 4 3 2 2 3 4 5 6 2 3 4 5 6

7! M A 7! M A 2.5 2.5 Exaggeration 2.5 0.5 2.5 0.5 2 3 4 2 3 4 2.5 2.5 0.5 2.5 2.5 0.5 2 3 4 2 3 4 Two similar target value functions Larger dierence between tted functions Exaggeration = max-norm expansion

Goal Hill-car Forward Reverse Gravity tries to drive up hill, but engine is weak so must back up to Car momentum get

Results 2 2.5.5 0.5 0.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5 Approximate Exact

Results the hard way 2.5 0.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5

renement (use ows, policy uncertainty, and value Adaptive to decide where to split) [Munos & Moore] uncertainty Interesting topics left uncovered Hidden state (similar problems to fn approx) Eligibility traces Policy iteration and -PI Actor-critic architectures Duality and RL Stopping problems

6 +7 Bonus extra slide: RL counterexample with values (0 0 0) Start one backup: (0 ) Do a line: (; 6 2 7 6 ) Fit another backup: (0 + 7 Do 6 ) For 6 7, divergence