2D1431 Machine Learning Lab 3: Reinforcement Learning

Similar documents
Bellman Optimality Equation for V*

Reinforcement Learning

Administrivia CSE 190: Reinforcement Learning: An Introduction

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Reinforcement learning II

19 Optimal behavior: Game theory

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

1 Online Learning and Regret Minimization

20 MATHEMATICS POLYNOMIALS

Math 1B, lecture 4: Error bounds for numerical methods

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Monte Carlo method in solving numerical integration and differential equation

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Reinforcement Learning and Policy Reuse

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

Continuous Random Variables

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Review of Calculus, cont d

Math 31S. Rumbos Fall Solutions to Assignment #16

CS 188: Artificial Intelligence Spring 2007

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

LECTURE NOTE #12 PROF. ALAN YUILLE

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Chapter 0. What is the Lebesgue integral about?

New Expansion and Infinite Series

Lecture Note 9: Orthogonal Reduction

Review of basic calculus

We will see what is meant by standard form very shortly

A Fast and Reliable Policy Improvement Algorithm

APPROXIMATE INTEGRATION

Reinforcement learning

Recitation 3: More Applications of the Derivative

Math 270A: Numerical Linear Algebra

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon

Numerical integration

Travelling Profile Solutions For Nonlinear Degenerate Parabolic Equation And Contour Enhancement In Image Processing

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

1 Probability Density Functions

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Math 360: A primitive integral and elementary functions

The Regulated and Riemann Integrals

Ordinary Differential Equations- Boundary Value Problem

Infinite Geometric Series

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

dt. However, we might also be curious about dy

Numerical Integration

Chapter 5 : Continuous Random Variables

Vyacheslav Telnin. Search for New Numbers.

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

Nondeterminism and Nodeterministic Automata

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

CS 188: Artificial Intelligence

Operations with Polynomials

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Chapter 3 Polynomials

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

How do you know you have SLE?

Student Activity 3: Single Factor ANOVA

p-adic Egyptian Fractions

Chapter 14. Matrix Representations of Linear Transformations

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Actor-Critic. Hung-yi Lee

CS 275 Automata and Formal Language Theory

CS 188: Artificial Intelligence Fall Announcements

Main topics for the First Midterm

Physics 202H - Introductory Quantum Physics I Homework #08 - Solutions Fall 2004 Due 5:01 PM, Monday 2004/11/15

CS 188: Artificial Intelligence Fall 2010

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

NUMERICAL INTEGRATION

Best Approximation. Chapter The General Case

1 The Riemann Integral

1B40 Practical Skills

Bayesian Networks: Approximate Inference

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

CBE 291b - Computation And Optimization For Engineers

Numerical Linear Algebra Assignment 008

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions

Review of Gaussian Quadrature method

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

The solutions of the single electron Hamiltonian were shown to be Bloch wave of the form: ( ) ( ) ikr

f(x)dx . Show that there 1, 0 < x 1 does not exist a differentiable function g : [ 1, 1] R such that g (x) = f(x) for all

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Here we study square linear systems and properties of their coefficient matrices as they relate to the solution set of the linear system.

Chapter 3. Vector Spaces

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Transcription:

2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed tht you re fmilir with the bsic concepts of reinforcement lerning nd tht you hve red chpter 13 in the course book Mchine Lerning (Mitchell, 1997). The first four chpters of the survey on reinforcement lerning by Kelbling et l. (1996) is good supplementry mteril. For further reding nd detiled discussion of policy itertion nd reinforcement lerning, the textbook Reinforcement Lerning is highly recommendble (Sutton nd Brto, 1999). In prticulr studying chpters 3,4 nd 6 is of immense help for this lb. The predefined Mtlb functions for this lb re locted in the course directory /info/mi04/lbs/lb3. Dynmic progrmming refers to clss of lgorithms tht cn be used to compute optiml policies given complete model of the environment. Dynmic progrmming solves problems tht cn be formulted s Mrkov decision processes. Unlike in the reinforcement lerning cse, dynmic progrmming ssumes tht the stte trnsition nd rewrd functions re known. The centrl ide of dynmic progrmming nd reinforcement lerning is to lern vlue functions, which in turn cn be used to identify the optiml policy. 2 Policy Evlution nd Policy Itertion First we consider policy evlution, nmely how to compute the stte-vlue function V π for n rbitrry policy π. For the deterministic cse the vlue 1

function hs to obey the Bellmn eqution. V π (s) = r(s, π(s)) + γv π (δ(s, π(s))) (1) where δ(s, ) : S A S nd r(s, ) : S A R re the deterministic stte trnsition nd rewrd function. This eqution cn be either solved directly, by solving liner eqution of the type V = R + BV (2) where V nd R re vectors nd B is mtrix. An lterntive is to solve eqution 1 by successive pproximtion, nd considering the Bellmnn eqution s n updte rule V π k+1 = r(s, π(s)) + γv π k (δ(s, π(s))) (3) The sequence of Vk π cn be shown to converge to V π s k. This method is clled itertive policy evlution. If the policy is stochstic, i.e., the ction in given sitution s is probbility distribution over possible ctions, then we will use π(s, ) to denote the probbility of tking ction. The itertive Bellmn eqution then hs the following form: V π k+1 = π(s, ) (r(s, ) + γv π k (δ(s, ))) (4) For the non-deterministic cse, the trnsition nd rewrd functions hve to be replced by probbilistic functions. In tht cse the Bellmn equtions become: V π (s) = s P (s s, π(s))(r(s, s, π(s)) + γv π (s )) (5) where P (s s, ) is the probbility tht the next stte is s when executing ction in stte s nd R(s, s, ) is the rewrd when executing ction in stte s nd trnsitioning to the next stte s. Policy evlution for the nondeterministic cse, cn be formulted s n updte rule similr to eqution 3 by Vk+1 π = P (s s, π(s))(r(s, s, π(s)) + γvk π (s )) (6) s Our min motivtion for computing the vlue function for policy is to improve on our current policy. For some stte s we cn improve our current policy by picking n lterntive ction π(s) tht devites from our current policy π(s) if it hs higher ction vlue function Q(s, ) > 2

Q(s, π(s)). This process is clled policy improvement. In other words, for ech stte s we greedily choose the ction tht mximizes Q π (s, ) π (s) = rgmx Q π (s, ) = rgmx r(s, ) + γv (δ(s, ) (7) Once policy π hs been improved using V π to yield better policy π, we cn then compute V π nd improve it gin to yield n even better π. Policy itertion intertwines policy evlution nd policy improvement ccording to V π k+1(s) = mx π k+1 (s) = rgmx Q(s, ) = mx (r(s, ) + γv π k (δ(s, ))) Q(s, ) = rgmx For the non-deterministic cse we obtin V π (r(s, ) + γv π k (δ(s, ))) (8) k+1(s) = mx Q(s, ) = mx P (s s, )(R(s, s, ) + γv π (s )) s π k+1 (s) = rgmx Q(s, ) = rgmx P (s s, )(R(s, s, ) + γvk π (s )) (9) s It cn be shown tht policy itertion converges to the optiml policy. Notice, tht ech policy evlution, itself n itertive computtion, is strted with the vlue function for the previous policy. Assume grid world of 4 4 cells tht correspond to 16 sttes enumerted s 1,..., s 16 s shown in Figure 1. In ech stte the gent cn choose one of the four possible ctions (North, West, South, Est) in order to move to neighboring cell. If the gent ttempts to move beyond the limits of the grid world, for exmple going est in stte s 8 locted t the right edge, it remins in the originl cell but incurs penlty of -1. There re two specil cells A (s 1 ) nd B (s 3 ) from which the gent is bemed to the cells A (s 13 ) respectively B (s 11 ) independent of the ction it chooses. When being bemed it receives rewrd of +10 for the trnsition from A to A nd rewrd of +5 for the trnsporttion from B to B. For ll other moves tht do not ttempt to led outside the grid world the rewrd is zero. There re no terminl sttes nd the gent tries to mximize its future discounted rewrds over n infinite horizon. Assume discount fctor of γ = 0.9. Due to the discount fctor the ccumulted rewrd remins finite even if the 3

A 1 B 2 3 4 +10 5 +5 6 7 8 B 9 10 11 12 A 13 14 15 16 Figure 1: Grid world. Independent of the ction tken by the gent in cell A, it is bemed to cell A nd receives rewrd of +10. The sme pplies to B nd B with rewrd of +5. problem hs n infinite horizon. Notice, tht returning from B to B, only tkes minimum of two steps, wheres going bck to A from A tkes t lest three steps. Therefore, it is not immeditely obvious which policy is optiml. Assignment 1: Use vlue itertion to compute the vlue function V π (s) for n equiprobble policy in which t ech stte ll four possible ctions (including the ones tht ttempt to cross the boundry of the grid world) hve the sme uniform probbility π(s, ) = 1/4. Assume discount fctor γ = 0.9. Use vlue itertion ccording to the Bellmn equtions in (4) to pproximte the vlue function. You cn either use two rrys, one for the old vlues Vk π (s) nd one for the new vlues Vk+1 π (s). This wy the new vlues cn be computed one by one from the old vlues without the old vlues being chnged. It turns out however, tht it is esier to use synchronous updtes, with ech new vlue immeditely overwriting the old one. Asynchronous updtes lso converges to V π, in fct it usully converges fster thn the synchronous updte two-rry version. As n exmple we compute the new vlue of stte s 8. For the four possible ctions 4

North, West, South, Est the successor sttes re δ(s 8, North) = s 4, δ(s 8, South) = s 12, δ(s 8, W est) = s 7 nd δ(s 8, Est) = s 8 (the gent ttempts to leve the grid world nd remins in the sme squre). The rewrds re ll zero except for the penlty r(s 8, Est) = 1 when tking the Est ction. All ctions re eqully likely, therefore π(s 8, North) = π(s 8, South) = π(s 8, W est) = π(s 8, Est) = 1/4. In Mtlb we use vector of length 16 to store the vlue function. The updte rule for stte s 8 would look like: >> gmm=0.9; >> V=zeros(16,1); >> V(8) = 1/4 * (-1 + gmm* (V(4) + V(7) + V(12) + V(8))) The Mtlb function plot_v(v,rnge,pi) plots the stte vlue function s color plot. The first rgument V is 16 1-vector with the stte vlues V (s i ). The second optionl rgument rnge is 2 1- vector to specify the lower nd upper bound of the vlue function for scling the color-plot. The defult rnge is [ 10 30]. The third optionl rgument pi is 16 1-vector for specifying the current policy π(s) : S A, where by definition, the ctions North, Est, South, West re clockwise enumerted from 1 to 4. Use policy itertion bsed on eqution 8 to compute the optiml vlue function V nd policy π (s, ). It might be esier to use the ction vlue function Q(s, ) rther thn the stte vlue functionv (s). In Mtlb you represent Q(s, ) by 16 4-mtrix, where the first dimension corresponds to the stte, nd the second dimension to the ction. Visulize the optiml vlue function nd policy using plot_v. After how mny itertions does the lgorithm find n optiml policy, ssuming the initil stte vlues re zero? Is the optiml policy unique? Wht hppens if you initilize the stte vlue function with rndom vlues rther thn zero >> V=10.0*rnd(16,1); Does the lgorithm converge to different policy? Assignment 2: Assume, tht the trnsition function is no longer deterministic, but given by the probbility P (s s, ). Compute the optiml vlue function V nd 5

policy π (s, ) using policy itertion ccording to equtions 9, for nondeterministic stte trnsition function. Assume tht with probbility p = 0.7, the gent moves to the correct squre s indicted by the desired ction, but with probbility 1 p = 0.3 rndom ction is tken tht pushes the gent to rndom neighboring squre. The rndom squre cn be coincidentlly the very sme cell tht ws originlly preferred by the ction. A rndom ction cn lso be n illegl move, tht incurs penlty of -1. Visulize the optiml vlue function nd policy using plot_v. After how mny itertions does the lgorithm find n optiml policy, ssuming the initil stte vlues re zero? Is the optiml policy unique? 3 Temporl Difference Lerning This ssignment dels with the generl reinforcement lerning problem, in tht we no longer ssume tht the stte trnsition nd rewrd functions re known. Temporl difference (TD) lerning directly lern from experience nd do not rely on model of the environment s dynmics. TD methods updte the estimte of the ction vlue function bsed on lerned estimtes, in other words unlike Monte Crlo methods which updte their estimtes only t the end of n episode, they bootstrp nd updte their beliefs immeditely fter ech stte trnsition. For more detils on temporl difference lerning red chpters six nd seven of the reinforcement lerning book Sutton nd Brto (1999). Temporl difference lerning is esier formulted using the ction vlue function Q(s, ) rther thn the stte vlue function V (s) which re relted through Q π (s, ) = P (s s, )R(s, s, ) + γv π (s ) (10) s In contrst to dynmic progrmming, the gent lerns through interction with the environment. There is need for ctive explortion of the stte spce nd the possible ctions. At ech stte s the gent chooses n ction ccording to its current policy, nd observes n immedite rewrd r nd new stte s. This sequence of stte, ction, rewrd, stte, ction motivtes the nme SARSA for this form of lerning. The ction vlue function cn be lerned by mens of off-policy TD lerning lso clled Q-lerning. In its simplest form, one step Q-lerning, it is defined by the updte rule Q(s, ) = Q(s, ) + α(r + γ mx Q(s, ) Q(s, )) (11) 6

In this cse, the lerned ction-vlue function Q(s, ) directly pproximtes the optiml vlue function Q (s, ), independent of the policy followed, hence off-policy lerning. However, the policy π(s, ) : S A R (π(s, ) is the probbility of tking ction in stte s) still hs n effect in tht it determines which stte-ction pirs re visited nd updted. All temporl difference methods hve need for ctive explortion, which requires tht the gent every now nd then tries lterntive ctions tht re not necessrily optiml ccording to its current estimtes of Q(s, ). The policy is generlly soft, mening tht π(s, ) > 0 for ll sttes nd ctions. An ɛ-greedy policy stisfies this requirement, in tht most of the time with probbility 1 ɛ it picks the optiml ction ccording to π(s) = rgmx Q(s, ) (12) but with smll probbility ɛ it tkes rndom ction. Therefore, ll nongreedy ctions re tken with the probbility π(s, ) = ɛ/a(s), where A(s) is the number of lterntive ctions in stte s. As the gent collects more nd more evidence the policy shifts towrds deterministic optiml policy. This cn be chieved by decresing ɛ with n incresing number of observtions, for exmple ccording to ɛ(t) = ɛ 0 (1 t/t ) (13) where T is the totl number of itertions. Resonble vlues for lerning nd explortion rte re α = 0.1 nd ɛ 0 = 0.2. The off-policy TD lgorithm cn be summrized s Initilize Q(s, ) rbitrrily Initilize s Repet for ech step Choose from s using ɛ-greedy policy bsed on Q(s, ) Tke ction, observe rewrd r, nd next stte s Updte Q(s, ) = Q(s, ) + α(r + γ mx Q(s, ) Q(s, )) Replce s with s until T steps 7

Assignment 3: For n unknown environment the gent is supposed to lern the optiml policy by mens of off-policy temporl difference lerning. The stte spce consists of 25 sttes s 1,..., s 25, corresponding to 5 5 grid-world. In ech stte the gent hs the choice between four possible ctions 1,..., 4, which cn be ssocited to the four directions North, Est, South, West. However, the trnsition function is not deterministic, which mens the gent sometimes ends up in non-neighboring squre. Assume, tht the exct model of the environment nd the rewrds re unknown. The dynmics of the environment re determined by the Mtlb functions s = strtstte nd [s_new rewrd] = env(s_old,ction). The function strtstte returns the initil stte. The sttes s 1,..., s 25 re represented by the integers 1,..., 25, nd the ctions 1,..., 4 re enumerted by 1,..., 4. The function [s_new rewrd] = env(s_old,ction) computes the next stte s_new nd the rewrd rewrd when executing ction ction in the current stte s_old. Represent the ction vlue function Q(s, ) by 25 4-mtrix Q. Given Q you cn compute the optiml policy pi(s) nd stte vlue function V nd visulize it with plot_v_td(v,rnge,pi) using the following code >> [V pi] = mx(q,[],2); >> plot_v_td(v,[-5 15],pi); The function plot_v_td(v,rnge,pi) is the counterprt to the Mtlb function plot_v(v,rnge,pi) for the 4 4-gridworld used in the erlier ssignments. The function plot_trce(sttes,ctions,tlength) cn be used to plot trce of the most recently visited sttes. The prmeter sttes is N 1-vector tht contins the history of recent sttes s(t),..., s(t + N), the prmeter ctions is N 1-vector tht stores the history of recent ctions (t 1),..., (t + N 1), nd tlength determines how mny sttes from the pst re plotted. Build history of sttes, ctions nd rewrds when iterting the TD-lerning lgorithm, by ppending the new stte s, ction nd rewrd r to the history of previous sttes, ctions nd rewrds. >> for k=1:itertions >>... >> sttes = [sttes s]; >> ctions = [ctions ]; >> rewrds = [rewrds r]; >>... 8

>> end >> plot_trce(sttes,ctions,12); Run the off-policy TD lerning lgorithm for 20000 steps. Initilize the Q(s, ) with smll positive vlues (e.g. 0.1) in order to bis the TD-lerning to explore lterntive ctions in the erly stges, when most of the time the rewrds re zero. Every 500 steps visulize the current stte vlue function V (s), optiml policy π(s) plot trce of the recently visited sttes nd ctions. nd compute the verge rewrd over the pst 500 steps nd plot the evolution of the verge nd ccumulted rewrd s function of the number of itertions. Experiment with different settings for the explortion prmeter ɛ 0 nd lerning rte α. Cn you think of n extension to the one-step TD-lerning lgorithm tht would help to lern the optiml policy in fewer number of itertions? If you hve time, try to implement this extension. References L. P. Kelbling, M. L. Littmn, nd A. W. Moore. Reinforcement lerning: A survey. Journl of Artificil Intelligence Reserch, 4:237 285, 1996. T. M. Mitchell. Mchine Lerning. McGrw Hill, 1997. R. Sutton nd A. Brto. Reinforcement Lerning. MIT Press, 1999. Also vilble online t http://www-nw.cs.umss.edu/~rich/book/the-book.html 9